This document contains a comprehensive workflow for the course Introduction to R: first steps, statistics, graphics, offered in the context of the CTH Course Series on Statistics - 2024.

The content will follow closely the slides used during the course itself, with particular highlighting regarding the code and its output. This should allow you to replicate all the steps, constituting a starting base for extending the commands e.g. to your own datasets.

Table of contents

  • Step 0: First things first: know your tools, R & RStudio
  • Step 1: Basics of R, RStudio, help, packages
  • Step 2: Data in, data out
  • Step 3: Analyzing (tabular) data: describe, explore, transform, summarise data
  • Step 4: Plotting data
  • Bonus Step: reproducible reports, i.e. escaping “the R script”

Schedule (tentatively)

9:00-10:00 — Step 0 & Bonus: How to make the most of this 10:00-12:30 — Step 1, 2, 3 (with short breaks in between)
12:30-13:30 — Lunch break
13:30-15:30 — Step 3, 4, Q&As

Step 0: You, R and RStudio, knowing you and knowing your tools

Let’s get to know each other a bit…

0.1 What is R? Why should I use R?

Available at https://www.r-project.org/

  • free statistical environment for interactive use
  • intepreted, functional scripting/programming language - you type in, you see output
  • descends from the S language, written by statisticians for statisticians

What can you do with R?

  • Anything!
    • do calculations
    • write functions
    • analyse data. ALL the data. Well, almost. But really, almost.
    • apply advanced statistical techniques
    • do beautiful & publication-ready plots
    • develop interactive web-applications
    • presentations & documents (this one)

Why should I use R?

  • it works! And it is quite powerful
  • it is free, open-source, and available for all OS
  • you can really do whatever you might aim to do in terms of statistics
  • it offers awesome possibilities for (interactive) graphing
  • it has a wide, active and competent community (ok, communities: statistics, bioinformatics, machine learning, …)
  • it can be extended with packages. More power!
  • escape Point-and-Click-land, you work with syntax: you can use, re-use elements, validate & reproduce analysis

Why should I not use R?

  • you use it or lose it
  • the learning curve might be steep
  • can be frustrating if you have errors
  • help might be available, but it is very technical
  • many packages, a blessing and a curse: how many, how good

0.2 Let’s get started!

Get R - and RStudio

Alternatives:

0.3 First ride: look around you

Open up RStudio - You’ll have four panes

  • the Source for your scripts and documents (top-left, in the default layout)
  • your Environment/History (top-right),
  • your Files/Plots/Packages/Help/Viewer (bottom-right), and
  • the R Console (bottom-left).

Want to customize this?

Tools \(\rightarrow\) Global Options \(\rightarrow\) Pane Layout

Advantages of an IDE

  • all in one window!
  • keyboard shortcuts, autocompletion, highlighting \(\rightarrow\) type easier, do less errors

0.4 Folder structure

It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory.

Why?

  • Easier to have “self-contained” research units!
  • A project “does not interfere” with other projects
  • Gives a structure, easier to find things, use, reuse
  • Someone else (including future you) can understand what goes on

How?

RStudio projects!
Custom settings, per project.

Let’s create one live now - and have the workspace NOT saved

What is the best structure?

One, used consistently - not gonna touch on naming things as it can get hot quickly :)

With R…

dir.create()

file.edit()

Where am I doing things?

The working directory is the place from where R will be looking for and saving the files.

getwd()/setwd(), but not in your scripts (fails on others’ computers!)

0.5 Interacting with R

Instructions, commands.

Scripts, console - use the editor and have a complete record on what you did!

Shortcuts FTW!

Even better: Reproducible documents, with R Markdown

Nice resources on top: * RStudio cheatsheet about the RStudio IDE! * the internet/rstats community!

0.6 Seeking help

  • ?
  • ??
  • built-in RStudio help interface - and shortcuts!

0.6.1 Where to ask for help?

  • your neighbour - if COVID-conform, do interact within each other!
  • your colleagues
  • https://rdocumentation.org/ website
  • the web: google, StackOverflow

The main point: describe well your problem, “help others help you”

Others need to reproduce your error to help you better: saveRDS(), dput(), sessionInfo()

0.7 R packages

R packages…

  • are fundamental components of R ecosystem
  • extend base R functionality for a specific purpose
  • bundle new functions, data sets, and documentation
  • are contributed by independent developers
  • have dependency management

Repositories:

  • CRAN: Managed official package repository network
  • Bioconductor: curated bioinformatics packages (vignettes mandatory, integrated ecosystem!)
  • GitHub: un-managed, bleeding edge - but also excellent ones (“just not on CRAN”)

Our contributions so far:

  • flowcatchR https://bioconductor.org/packages/flowcatchR/
  • pcaExplorer https://bioconductor.org/packages/pcaExplorer/
  • ideal https://bioconductor.org/packages/ideal/
  • GeneTonic (https://bioconductor.org/packages/GeneTonic)
  • iSEE (https://bioconductor.org/packages/iSEE)
  • simrec (for simulating recurring events, https://cran.r-project.org/web/packages/simrec/)

0.7.1 How to use packages

  • Install: once (for every major R version)
  • Load: in each session
  • Use like any base R functionality

For Bioconductor packages…

install.packages("BiocManager")
library("BiocManager")
BiocManager::install()

Relevant commands:

  • install.packages("packagename") - check it online at CRAN!
  • installed.packages()
  • .libPaths()
  • update.packages()
  • library("packagename")
  • help(package="packagename"), data(), browseVignettes(), vignette(), citation("packagename")

Something you might have already done, if you worked on RNA-seq data:

BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")

0.8 How should I use this material?

Use this very same text document and expand this!

Why? Let’s have a look first into the Bonus Content, actually: we navigate to Section 5 and continue from there.

1 Step 1: Introduction to R

Here we will touch on the first commands in R, so that you can

  • Define the following terms as they relate to R: object, assign, call, function, arguments, options.
  • Assign values to objects in R.
  • Learn how to name objects
  • Use comments to inform script.
  • Solve simple arithmetic operations in R.
  • Call functions and use arguments to change their default options.
  • Inspect the content of vectors and manipulate their content.
  • Subset and extract values from vectors.
  • Analyze vectors with missing data.

1.1 R is a powerful calculator

… but not just that.

Type the following

2 + 2
# [1] 4
log(2)
# [1] 0.6931472
347 * 73841
# [1] 25622827
7 / 2
# [1] 3.5
7 %/% 2
# [1] 3
7 %% 2
# [1] 1

What did these commands do?

1.2 🎶 Help! 🎶

# this calls the help for a function to plot a histogram
?hist
# this is just the same
help(hist) ## what about ??
?apropos
apropos("row")
#  [1] ".row"                   ".rowMeans"              ".rowNamesDF<-"         
#  [4] ".rowSums"               "arrows"                 "browseEnv"             
#  [7] "browser"                "browserCondition"       "browserSetDebug"       
# [10] "browserText"            "browseURL"              "browseVignettes"       
# [13] "n2mfrow"                "nrow"                   "NROW"                  
# [16] "PlantGrowth"            "row"                    "row.names"             
# [19] "row.names.data.frame"   "row.names.default"      "row.names<-"           
# [22] "row.names<-.data.frame" "row.names<-.default"    "rowMeans"              
# [25] "rownames"               "rownames<-"             "rowsum"                
# [28] "rowsum.data.frame"      "rowsum.default"         "rowSums"               
# [31] "ToothGrowth"            "xpdrows.data.frame"
  • integrated help system, with executable examples
  • (for some packages) vignettes (typical problem, commands, and workflow)
  • CRAN Task Views: https://cran.r-project.org/web/views/
  • Books!
  • Courses!
  • Online: mailing lists, forums (StackOverflow, …), blogs, Twitter (#rstats)

1.3 Your starting vocabulary - a.k.a. Exercise Session 0

  • getwd() and setwd() - Tab is your friend
  • <-, =: the assignment operator
  • ls(), rm()
  • str()
  • example(), help()/?[function]
  • print()
  • q()/quit()
  • logical operators: TRUE,FALSE,!,==,!=,<,>,<=,>=,|,&,xor()
  • c()
  • data have help items too: e.g. cars

Find out what these do!

1.3.1 Exercise Session 0 - Solutions

Click on this to display the solution
?getwd
?setwd
?`<-`
help(ls)
help(rm)
?str
?example
help(help)
?print
help(quit)
?c
?cars

1.4 Make your life easier - Notes for your future self

  • add comments and document your own code
# This is a comment
  • write clean code - use spaces, indentation
  • use an editor with syntax highlighting/some form of autocompletion

Careful here:

  • R is case sensitive and has zero-tolerance with mis-spelled names
  • parenthesis: open and close them
  • special attention with missing values, factors VS strings: R is clever, but you might think differently
  • do not be stingy with parentheses - if this helps you
  • same goes with comments - your colleagues and your future self will thank you

1.5 Exercise session 1

First things first:

  • Grab some mini-postit (for those of you participating in person)!
  • Get ready to use the chat & the reactions

Then:

  • find out more about the iris dataset. What is it about at all? How many variables are included? How many observations?
  • replicate! find out a function that replicates elements of a vector to produce this
1 1 1 1

BONUS: … and this

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1.5.1 Exercise Session 1 - Solutions

Click on this to display the solution
rep(1, 4)
# [1] 1 1 1 1

rep(c(1, 2, 3, 4, 5), 3)
#  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1.6 Data types

R can recognize different general types of data

  • numbers (numeric)

  • character strings (text)

  • logical (e.g. class(TRUE))

  • factors (“integers with a set of labels”) - it is categorical data!

  • special ones: dates, time, …

When and where to use which?

1.7 I was curious about the participants

In a previous edition of a similar course, I asked the future participants:

  • How old are you?
  • What is your current academic level? (PI, Postdoc, PhD, master)
  • What is your current knowledge level of R? (pro-good-intermediate-poor-none)
  • What is your knowledge of programming languages in general?
  • What is your experience level with genomics and RNA-seq data?
  • How familiar are you with mogon and parallel computing? (I am a regular user/Once in a while i used it/I know it exists/I heard we had some servers around/Is this supposed to be in the cloud?!)
  • What are your expectations from the course?

2 Step 2: Data in, data out

2.1 Importing data in R

80-20? 90-10? Import, clean, prepare, transform your data

Sources:

  • Files, Clipboard, URL
  • Plain text file: Comma-separated, tab-delimited, …
  • R format file
  • SAS / Stata / SPSS file: package haven
  • Spreadsheet (Excel): package readxl - highly recommended!
  • Database: RSQLite, RPostgreSQL, RMySQL, …

2.2 The vocabulary of importing

… and exporting

  • read.table(),write.table() + read.csv|delim
  • the option stringsAsFactors=FALSE
  • load(),save()/readRDS(),saveRDS()
  • via haven : read_sas(),read_spss() /write_sas(),write_sav()
  • via readxl: read_excel()

Check out their documentation pages!

Other options: rio, RStudio GUI

2.3 Take a look at the data

Go to https://github.com/federicomarini/rbioc2016

\(\rightarrow\) inst/extdata

\(\rightarrow\) survey_responses.csv, in its raw format

You can load it directly like this

surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")

Or install the package and load it from there

library("devtools")
install_github("federicomarini/rbioc2016")
library("rbioc2016")
data(surveyrbioc)

2.4 Input data: Step by step, by hand?

Sometimes your data is either small and/or not in an Excel-like tabular format.

What to do? You combine the elements together!

Q1 <- c(28, 27, 33, 32, 29)
# should return this
Q1
# [1] 28 27 33 32 29

Q2 <- c("PhD student", "PhD student", "Postdoc", "PhD student", "PhD student")
Q2
# [1] "PhD student" "PhD student" "Postdoc"     "PhD student" "PhD student"
# ... and so on

2.5 Combine the variables to a matrix

We have seen c(). We also have

  • cbind
  • rbind
firstTwo <- cbind(Q1, Q2)
firstTwo
#      Q1   Q2           
# [1,] "28" "PhD student"
# [2,] "27" "PhD student"
# [3,] "33" "Postdoc"    
# [4,] "32" "PhD student"
# [5,] "29" "PhD student"
rbind(Q1, Q2)
#    [,1]          [,2]          [,3]      [,4]          [,5]         
# Q1 "28"          "27"          "33"      "32"          "29"         
# Q2 "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"

Is this what you wanted?

2.6 Applying the first functions

But first, what can you do on these objects?

sum(Q1)
# [1] 149
sum(Q2)
# Error in sum(Q2): invalid 'type' (character) of argument
summary(Q1)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    27.0    28.0    29.0    29.8    32.0    33.0
summary(Q2)
#    Length     Class      Mode 
#         5 character character
str(Q1)
#  num [1:5] 28 27 33 32 29
str(Q2)
#  chr [1:5] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
mean(Q1)
# [1] 29.8
dim(firstTwo)
# [1] 5 2
firstTwo[, 1]
# [1] "28" "27" "33" "32" "29"
mean(firstTwo[, 1]) # Why, damn, why? Meet coercion
# [1] NA
class(firstTwo)
# [1] "matrix" "array"

2.7 matrix, data.frame and list

  • a matrix can contain one type of data - if numeric, you unleash all the matrix algebra power!
  • a data.frame can store more types of data (one per column)
  • a list is like a big box where you can put anything - but this is not always what you want

What is best?

Let’s try with a list

Q3 <- c("intermediate", "poor", "good", "none", "intermediate")
mylist <- list(Q1, Q2, Q3)
mylist
# [[1]]
# [1] 28 27 33 32 29
# 
# [[2]]
# [1] "PhD student" "PhD student" "Postdoc"     "PhD student" "PhD student"
# 
# [[3]]
# [1] "intermediate" "poor"         "good"         "none"         "intermediate"
## access your elements with
mylist[[1]]
# [1] 28 27 33 32 29
mylist[[1]][2]
# [1] 27

How do we create a data.frame?

mydf <- data.frame(
  age = Q1,
  level = Q2,
  rexp = Q3
)
mydf
#   age       level         rexp
# 1  28 PhD student intermediate
# 2  27 PhD student         poor
# 3  33     Postdoc         good
# 4  32 PhD student         none
# 5  29 PhD student intermediate
class(mydf$age)
# [1] "numeric"

2.7.1 Exploring a data.frame

mydf$age # it's all about the money :)
# [1] 28 27 33 32 29
mydf[, 1]
# [1] 28 27 33 32 29
names(mydf)
# [1] "age"   "level" "rexp"
rownames(mydf)
# [1] "1" "2" "3" "4" "5"
dim(mydf)
# [1] 5 3
nrow(mydf)
# [1] 5
ncol(mydf)
# [1] 3
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
head(surveyrbioc)
#   Q1          Q2           Q3           Q4           Q5                                    Q6
# 1 28 PhD student intermediate intermediate         good                   I am a regular user
# 2 27 PhD student         poor intermediate         poor                      I know it exists
# 3 33     Postdoc         good         good         good                      I know it exists
# 4 32 PhD student         poor         poor         poor Is this supposed to be in the cloud?!
# 5 29 PhD student         none         poor         poor                      I know it exists
# 6 33     Postdoc intermediate intermediate intermediate             Once in a while i used it
#                                                                                                                                                                                                                                               Q7
# 1                                                                                                                                                                                                Learn Parallelization with R (and Bioconductor)
# 2                                                                                                                                                                                                                                           <NA>
# 3                                                                                                                                                                                  use R scripts in parallel context (ex: alignments in RNA-seq)
# 4                                                                Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5                                                                                                                                                                                                 Possible I will use R for editing RNA-seq data
# 6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
tail(surveyrbioc)
#    Q1                  Q2           Q3           Q4           Q5
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39             Postdoc         good intermediate         good
# 19 32 Master student/else         none         poor    genoWhat?
# 20 29         PhD student         poor         none         poor
# 21 31         PhD student         none         none         good
# 22 30         PhD student         none         none         good
#                                       Q6                                                         Q7
# 17                      I know it exists       better understanding of R and to extend my knowledge
# 18                   I am a regular user            Mainly interested in parallel computing options
# 19 Is this supposed to be in the cloud?!                                                       <NA>
# 20 Is this supposed to be in the cloud?!          To better understand R and perform basic analysis
# 21    I heard we had some servers around                        working with fastq files based on R
# 22 Is this supposed to be in the cloud?! I want to be able to analyze my sequence data on my own :P
names(surveyrbioc)
# [1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7"
str(surveyrbioc)
# 'data.frame': 22 obs. of  7 variables:
#  $ Q1: int  28 27 33 32 29 33 40 23 27 23 ...
#  $ Q2: chr  "PhD student" "PhD student" "Postdoc" "PhD student" ...
#  $ Q3: chr  "intermediate" "poor" "good" "poor" ...
#  $ Q4: chr  "intermediate" "intermediate" "good" "poor" ...
#  $ Q5: chr  "good" "poor" "good" "poor" ...
#  $ Q6: chr  "I am a regular user" "I know it exists" "I know it exists" "Is this supposed to be in the cloud?!" ...
#  $ Q7: chr  "Learn Parallelization with R (and Bioconductor)" NA "use R scripts in parallel context (ex: alignments in RNA-seq)" "Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would"| __truncated__ ...
summary(surveyrbioc)
#        Q1             Q2                 Q3                 Q4                 Q5           
#  Min.   :23.00   Length:22          Length:22          Length:22          Length:22         
#  1st Qu.:27.25   Class :character   Class :character   Class :character   Class :character  
#  Median :29.50   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#  Mean   :30.14                                                                              
#  3rd Qu.:32.75                                                                              
#  Max.   :40.00                                                                              
#       Q6                 Q7           
#  Length:22          Length:22         
#  Class :character   Class :character  
#  Mode  :character   Mode  :character  
#                                       
#                                       
# 
surveyrbioc[, ]
#    Q1                  Q2           Q3           Q4           Q5
# 1  28         PhD student intermediate intermediate         good
# 2  27         PhD student         poor intermediate         poor
# 3  33             Postdoc         good         good         good
# 4  32         PhD student         poor         poor         poor
# 5  29         PhD student         none         poor         poor
# 6  33             Postdoc intermediate intermediate intermediate
# 7  40             Postdoc         good         good intermediate
# 8  23 Master student/else         poor         poor         good
# 9  27         PhD student         none         poor intermediate
# 10 23 Master student/else         poor         poor         poor
# 11 35             Postdoc         poor         poor intermediate
# 12 34 Master student/else         poor          pro         poor
# 13 31             Postdoc         none         none intermediate
# 14 27         PhD student         none         poor         poor
# 15 24 Master student/else         poor intermediate intermediate
# 16 28         PhD student         none         poor         poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39             Postdoc         good intermediate         good
# 19 32 Master student/else         none         poor    genoWhat?
# 20 29         PhD student         poor         none         poor
# 21 31         PhD student         none         none         good
# 22 30         PhD student         none         none         good
#                                       Q6
# 1                    I am a regular user
# 2                       I know it exists
# 3                       I know it exists
# 4  Is this supposed to be in the cloud?!
# 5                       I know it exists
# 6              Once in a while i used it
# 7                    I am a regular user
# 8                       I know it exists
# 9                       I know it exists
# 10 Is this supposed to be in the cloud?!
# 11                      I know it exists
# 12                      I know it exists
# 13 Is this supposed to be in the cloud?!
# 14                      I know it exists
# 15                      I know it exists
# 16                      I know it exists
# 17                      I know it exists
# 18                   I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21    I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
#                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Q7
# 1                                                                                                                                                                                                                                                                                                                                                                                                                                           Learn Parallelization with R (and Bioconductor)
# 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <NA>
# 3                                                                                                                                                                                                                                                                                                                                                                                                                             use R scripts in parallel context (ex: alignments in RNA-seq)
# 4                                                                                                                                                                                                                                                                                                           Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5                                                                                                                                                                                                                                                                                                                                                                                                                                            Possible I will use R for editing RNA-seq data
# 6                                                                                                                                                                                                                                            I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
# 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <NA>
# 8                                                                                                                                                                                                                                                                                                                                                                                                                                                      Get to know R better, basic commands
# 9                                                                                                                                                                                                                                                                                                                                                                                                                             To look if I can process my sequencing data on my own using R
# 10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
# 11                                                                                                                                                                                                                                                                                                                                                                                                                                                                       learn more about R
# 12                                                                                                                                                                                                                                                                                                                                                                                               Brush up on some R knowledge and maybe get some different perspective on Genome processing
# 13                                                                                                                                                                                                                                                                                                                                                                                                      to get a good and understandable introduction into R programming and bioinformatics
# 14                                                                                                                                                                                                                                                                                                                                                                                                                                                              learning how to work with R
# 15                                                                                                                                                                                                                                                                                                                                                                                                                                                     To learn the basics of R programming
# 16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     <NA>
# 17                                                                                                                                                                                                                                                                                                                                                                                                                                     better understanding of R and to extend my knowledge
# 18                                                                                                                                                                                                                                                                                                                                                                                                                                          Mainly interested in parallel computing options
# 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     <NA>
# 20                                                                                                                                                                                                                                                                                                                                                                                                                                        To better understand R and perform basic analysis
# 21                                                                                                                                                                                                                                                                                                                                                                                                                                                      working with fastq files based on R
# 22                                                                                                                                                                                                                                                                                                                                                                                                                               I want to be able to analyze my sequence data on my own :P

Don’t forget some nice built-in like the View() function, in RStudio - this is a nice alternative to “pure browsing in Excel”.

2.8 Exercise session 2

Using the surveyrbioc object:

  • Calculate the mean age of the participants
  • How many participants did actually take part to the survey?
  • How old was the oldest participant? (max can be your help)
  • transpose the survey data and assign it to another variable
  • Change the column names of this object and save this data set as a tab-separated ASCII file
  • BONUS: what was the youngest participant expecting?

2.8.1 Exercise Session 2 - Solutions

Click on this to display the solution
mean(surveyrbioc$Q1)
# [1] 30.13636

max(surveyrbioc$Q1)
# [1] 40

my_transposed_survey <- t(surveyrbioc)

surveyrbioc_mod <- surveyrbioc
colnames(surveyrbioc_mod) <- c("age", "level", "rlevel", "prog_level", "genomics_level", "parcomp_level", "expectation")

surveyrbioc_mod$expectation[which.min(surveyrbioc_mod$age)]
# [1] "Get to know R better, basic commands"

3 Step 3: Analyzing (tabular) data

Describe, explore, transform, summarise data

3.1 Exploring, subsetting, manipulating, analysing

  • dim(x) shows the dimensions of an object
  • str(x) provides an overview of the structure of an object and the elements it contains
  • sum(x), mean(x), sd(x) computes the sum, mean, or standard deviation of all the elements in x; median(x), quantile(x)
  • length(x) returns the number of elements in x (a vector)
  • sqrt(x), log(x) take the square root and the natural logarithm of a numeric - element or vector
  • hist(x, breaks=20, col="blue") plots a histogram of variable x with 20 bins colored blue
  • unique(x) returns the vector of unique elements in x
  • rm(x) removes the object x from the environment (rm(list=ls()) removes all objects)
  • sessionInfo() prints information about R session and versions of all attached packages
  • logical operators might often come handy!

3.2 Subsetting the data

This is the basic way it works

surveyrbioc[ROWS, COLUMNS]

You can subset with…

  • integers
  • blank spaces
  • names
  • logical vectors

Try to make a guess, given this vector.

vec <- c(6, 1, 3, 6, 10, 5)

What happens if you do this?

vec[2]
# [1] 1
vec[c(5, 6)]
# [1] 10  5
vec[-c(5, 6)]
# [1] 6 1 3 6
vec > 5
# [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE
vec[vec > 5]
# [1]  6  6 10

What happens if you do this?

df <- data.frame(
  name = c("John", "Paul", "George", "Ringo"),
  birth = c(1940, 1942, 1943, 1940),
  instrument = c("guitar", "bass", "guitar", "drums")
)

df
#     name birth instrument
# 1   John  1940     guitar
# 2   Paul  1942       bass
# 3 George  1943     guitar
# 4  Ringo  1940      drums

df[c(2, 4), 3]
# [1] "bass"  "drums"
df[, 1]
# [1] "John"   "Paul"   "George" "Ringo"
df[, "instrument"]
# [1] "guitar" "bass"   "guitar" "drums"
df$instrument
# [1] "guitar" "bass"   "guitar" "drums"

Back to the survey

# I just want the age
surveyrbioc[, 1]
#  [1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30
# or
surveyrbioc$Q1
#  [1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30

# the first 4 columns
surveyrbioc[, c(1, 2, 3, 4)]
#    Q1                  Q2           Q3           Q4
# 1  28         PhD student intermediate intermediate
# 2  27         PhD student         poor intermediate
# 3  33             Postdoc         good         good
# 4  32         PhD student         poor         poor
# 5  29         PhD student         none         poor
# 6  33             Postdoc intermediate intermediate
# 7  40             Postdoc         good         good
# 8  23 Master student/else         poor         poor
# 9  27         PhD student         none         poor
# 10 23 Master student/else         poor         poor
# 11 35             Postdoc         poor         poor
# 12 34 Master student/else         poor          pro
# 13 31             Postdoc         none         none
# 14 27         PhD student         none         poor
# 15 24 Master student/else         poor intermediate
# 16 28         PhD student         none         poor
# 17 28 Master student/else intermediate intermediate
# 18 39             Postdoc         good intermediate
# 19 32 Master student/else         none         poor
# 20 29         PhD student         poor         none
# 21 31         PhD student         none         none
# 22 30         PhD student         none         none
surveyrbioc[, 1:4]
#    Q1                  Q2           Q3           Q4
# 1  28         PhD student intermediate intermediate
# 2  27         PhD student         poor intermediate
# 3  33             Postdoc         good         good
# 4  32         PhD student         poor         poor
# 5  29         PhD student         none         poor
# 6  33             Postdoc intermediate intermediate
# 7  40             Postdoc         good         good
# 8  23 Master student/else         poor         poor
# 9  27         PhD student         none         poor
# 10 23 Master student/else         poor         poor
# 11 35             Postdoc         poor         poor
# 12 34 Master student/else         poor          pro
# 13 31             Postdoc         none         none
# 14 27         PhD student         none         poor
# 15 24 Master student/else         poor intermediate
# 16 28         PhD student         none         poor
# 17 28 Master student/else intermediate intermediate
# 18 39             Postdoc         good intermediate
# 19 32 Master student/else         none         poor
# 20 29         PhD student         poor         none
# 21 31         PhD student         none         none
# 22 30         PhD student         none         none

# all but the last column
surveyrbioc[, -7]
#    Q1                  Q2           Q3           Q4           Q5
# 1  28         PhD student intermediate intermediate         good
# 2  27         PhD student         poor intermediate         poor
# 3  33             Postdoc         good         good         good
# 4  32         PhD student         poor         poor         poor
# 5  29         PhD student         none         poor         poor
# 6  33             Postdoc intermediate intermediate intermediate
# 7  40             Postdoc         good         good intermediate
# 8  23 Master student/else         poor         poor         good
# 9  27         PhD student         none         poor intermediate
# 10 23 Master student/else         poor         poor         poor
# 11 35             Postdoc         poor         poor intermediate
# 12 34 Master student/else         poor          pro         poor
# 13 31             Postdoc         none         none intermediate
# 14 27         PhD student         none         poor         poor
# 15 24 Master student/else         poor intermediate intermediate
# 16 28         PhD student         none         poor         poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39             Postdoc         good intermediate         good
# 19 32 Master student/else         none         poor    genoWhat?
# 20 29         PhD student         poor         none         poor
# 21 31         PhD student         none         none         good
# 22 30         PhD student         none         none         good
#                                       Q6
# 1                    I am a regular user
# 2                       I know it exists
# 3                       I know it exists
# 4  Is this supposed to be in the cloud?!
# 5                       I know it exists
# 6              Once in a while i used it
# 7                    I am a regular user
# 8                       I know it exists
# 9                       I know it exists
# 10 Is this supposed to be in the cloud?!
# 11                      I know it exists
# 12                      I know it exists
# 13 Is this supposed to be in the cloud?!
# 14                      I know it exists
# 15                      I know it exists
# 16                      I know it exists
# 17                      I know it exists
# 18                   I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21    I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
# if you don't know we had 7 columns...
surveyrbioc[, -ncol(surveyrbioc)]
#    Q1                  Q2           Q3           Q4           Q5
# 1  28         PhD student intermediate intermediate         good
# 2  27         PhD student         poor intermediate         poor
# 3  33             Postdoc         good         good         good
# 4  32         PhD student         poor         poor         poor
# 5  29         PhD student         none         poor         poor
# 6  33             Postdoc intermediate intermediate intermediate
# 7  40             Postdoc         good         good intermediate
# 8  23 Master student/else         poor         poor         good
# 9  27         PhD student         none         poor intermediate
# 10 23 Master student/else         poor         poor         poor
# 11 35             Postdoc         poor         poor intermediate
# 12 34 Master student/else         poor          pro         poor
# 13 31             Postdoc         none         none intermediate
# 14 27         PhD student         none         poor         poor
# 15 24 Master student/else         poor intermediate intermediate
# 16 28         PhD student         none         poor         poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39             Postdoc         good intermediate         good
# 19 32 Master student/else         none         poor    genoWhat?
# 20 29         PhD student         poor         none         poor
# 21 31         PhD student         none         none         good
# 22 30         PhD student         none         none         good
#                                       Q6
# 1                    I am a regular user
# 2                       I know it exists
# 3                       I know it exists
# 4  Is this supposed to be in the cloud?!
# 5                       I know it exists
# 6              Once in a while i used it
# 7                    I am a regular user
# 8                       I know it exists
# 9                       I know it exists
# 10 Is this supposed to be in the cloud?!
# 11                      I know it exists
# 12                      I know it exists
# 13 Is this supposed to be in the cloud?!
# 14                      I know it exists
# 15                      I know it exists
# 16                      I know it exists
# 17                      I know it exists
# 18                   I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21    I heard we had some servers around
# 22 Is this supposed to be in the cloud?!

# you can subset with logical vectors, by row and by column
surveyrbioc[c(rep(TRUE, 10), rep(FALSE, 8)), ]
#    Q1                  Q2           Q3           Q4           Q5
# 1  28         PhD student intermediate intermediate         good
# 2  27         PhD student         poor intermediate         poor
# 3  33             Postdoc         good         good         good
# 4  32         PhD student         poor         poor         poor
# 5  29         PhD student         none         poor         poor
# 6  33             Postdoc intermediate intermediate intermediate
# 7  40             Postdoc         good         good intermediate
# 8  23 Master student/else         poor         poor         good
# 9  27         PhD student         none         poor intermediate
# 10 23 Master student/else         poor         poor         poor
# 19 32 Master student/else         none         poor    genoWhat?
# 20 29         PhD student         poor         none         poor
# 21 31         PhD student         none         none         good
# 22 30         PhD student         none         none         good
#                                       Q6
# 1                    I am a regular user
# 2                       I know it exists
# 3                       I know it exists
# 4  Is this supposed to be in the cloud?!
# 5                       I know it exists
# 6              Once in a while i used it
# 7                    I am a regular user
# 8                       I know it exists
# 9                       I know it exists
# 10 Is this supposed to be in the cloud?!
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21    I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
#                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Q7
# 1                                                                                                                                                                                                                                                                                                                                                                                                                                           Learn Parallelization with R (and Bioconductor)
# 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <NA>
# 3                                                                                                                                                                                                                                                                                                                                                                                                                             use R scripts in parallel context (ex: alignments in RNA-seq)
# 4                                                                                                                                                                                                                                                                                                           Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5                                                                                                                                                                                                                                                                                                                                                                                                                                            Possible I will use R for editing RNA-seq data
# 6                                                                                                                                                                                                                                            I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
# 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <NA>
# 8                                                                                                                                                                                                                                                                                                                                                                                                                                                      Get to know R better, basic commands
# 9                                                                                                                                                                                                                                                                                                                                                                                                                             To look if I can process my sequencing data on my own using R
# 10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
# 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     <NA>
# 20                                                                                                                                                                                                                                                                                                                                                                                                                                        To better understand R and perform basic analysis
# 21                                                                                                                                                                                                                                                                                                                                                                                                                                                      working with fastq files based on R
# 22                                                                                                                                                                                                                                                                                                                                                                                                                               I want to be able to analyze my sequence data on my own :P
surveyrbioc[c(TRUE, FALSE), ] # keep in mind this behavior!
#    Q1                  Q2           Q3           Q4           Q5
# 1  28         PhD student intermediate intermediate         good
# 3  33             Postdoc         good         good         good
# 5  29         PhD student         none         poor         poor
# 7  40             Postdoc         good         good intermediate
# 9  27         PhD student         none         poor intermediate
# 11 35             Postdoc         poor         poor intermediate
# 13 31             Postdoc         none         none intermediate
# 15 24 Master student/else         poor intermediate intermediate
# 17 28 Master student/else intermediate intermediate intermediate
# 19 32 Master student/else         none         poor    genoWhat?
# 21 31         PhD student         none         none         good
#                                       Q6
# 1                    I am a regular user
# 3                       I know it exists
# 5                       I know it exists
# 7                    I am a regular user
# 9                       I know it exists
# 11                      I know it exists
# 13 Is this supposed to be in the cloud?!
# 15                      I know it exists
# 17                      I know it exists
# 19 Is this supposed to be in the cloud?!
# 21    I heard we had some servers around
#                                                                                     Q7
# 1                                      Learn Parallelization with R (and Bioconductor)
# 3                        use R scripts in parallel context (ex: alignments in RNA-seq)
# 5                                       Possible I will use R for editing RNA-seq data
# 7                                                                                 <NA>
# 9                        To look if I can process my sequencing data on my own using R
# 11                                                                  learn more about R
# 13 to get a good and understandable introduction into R programming and bioinformatics
# 15                                                To learn the basics of R programming
# 17                                better understanding of R and to extend my knowledge
# 19                                                                                <NA>
# 21                                                 working with fastq files based on R

# guess what this does?
surveyrbioc$Q2 == "PhD student"
#  [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
# [17] FALSE FALSE FALSE  TRUE  TRUE  TRUE

3.3 Exercise session 3

  • How many PhD students did reply?
  • What was the proportion of PhD students to all other participants?
  • How old were they on average?
  • How many of the participants were older than 30?
  • How many postdocs were younger than 35?
  • How many of the participants did not reply to the last question?

3.3.1 Exercise Session 3 - Solutions

Click on this to display the solution
sum(surveyrbioc$Q2 == "PhD student")
# [1] 10
mean(surveyrbioc$Q2 == "PhD student")
# [1] 0.4545455
mean(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
# [1] 28.8
sum(surveyrbioc$Q1 >= 30)
# [1] 11
sum(surveyrbioc$Q1 < 35 & surveyrbioc$Q2 == "Postdoc")
# [1] 3
sum(is.na(surveyrbioc$Q7))
# [1] 4

3.4 Manipulating and analysing your data

You can

  • sort the data (see sort and order)
  • transform your data: apply rules (formulas, logics, insight altogether)
  • combine two datasets or more (if you merge them)
  • do some statistics on your data

3.5 Sorting the data

myord <- order(surveyrbioc$Q1)
myord
#  [1]  8 10 15  2  9 14  1 16 17  5 20 22 13 21  4 19  3  6 12 11 18  7

head(surveyrbioc[myord, 1:5], 4)
#    Q1                  Q2   Q3           Q4           Q5
# 8  23 Master student/else poor         poor         good
# 10 23 Master student/else poor         poor         poor
# 15 24 Master student/else poor intermediate intermediate
# 2  27         PhD student poor intermediate         poor
sorted_surv <- surveyrbioc[myord, 1:6]

sort() returns you the sorted data, order() the indices only

3.6 Transforming the data

# transforming a variable
newsurvey <- surveyrbioc[, 1:5]
newsurvey$ageroot <- sqrt(newsurvey$Q1)
head(newsurvey)
#   Q1          Q2           Q3           Q4           Q5  ageroot
# 1 28 PhD student intermediate intermediate         good 5.291503
# 2 27 PhD student         poor intermediate         poor 5.196152
# 3 33     Postdoc         good         good         good 5.744563
# 4 32 PhD student         poor         poor         poor 5.656854
# 5 29 PhD student         none         poor         poor 5.385165
# 6 33     Postdoc intermediate intermediate intermediate 5.744563

# creating groups out of a continuous variable
newsurvey$agegroup <- cut(newsurvey$Q1, breaks = c(20, 30, 40))
head(newsurvey)
#   Q1          Q2           Q3           Q4           Q5  ageroot agegroup
# 1 28 PhD student intermediate intermediate         good 5.291503  (20,30]
# 2 27 PhD student         poor intermediate         poor 5.196152  (20,30]
# 3 33     Postdoc         good         good         good 5.744563  (30,40]
# 4 32 PhD student         poor         poor         poor 5.656854  (30,40]
# 5 29 PhD student         none         poor         poor 5.385165  (20,30]
# 6 33     Postdoc intermediate intermediate intermediate 5.744563  (30,40]

Use case for merge: you have two sets you are playing with! Think in advance what you need for that purpose…

3.7 We want statistics!

Are PhD students significantly younger than postdocs? Are there any differences in the age of the three groups?

phds <- surveyrbioc[surveyrbioc$Q2 == "PhD student", ]
postdocs <- surveyrbioc[surveyrbioc$Q2 == "Postdoc", ]
t.test(phds$Q1, postdocs$Q1)
# 
#   Welch Two Sample t-test
# 
# data:  phds$Q1 and postdocs$Q1
# t = -4.0528, df = 6.4476, p-value = 0.005767
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -10.146796  -2.586537
# sample estimates:
# mean of x mean of y 
#  28.80000  35.16667
aov(data = surveyrbioc, Q1 ~ Q2) # What is missing here?
# Call:
#    aov(formula = Q1 ~ Q2, data = surveyrbioc)
# 
# Terms:
#                       Q2 Residuals
# Sum of Squares  216.8242  207.7667
# Deg. of Freedom        2        19
# 
# Residual standard error: 3.306824
# Estimated effects may be unbalanced

Much more on this: in the next courses!

3.8 Simple yet powerful functions

tapply

You want to calculate the median age of each academic group in here

md <- median(surveyrbioc$Q1)
md_master <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Master student/else"])
md_phd <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
md_postdocs <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Postdoc"])
c(md_master, md_phd, md_postdocs)
# [1] 26.0 28.5 34.0

tapply splits the data of the first variable on the levels of the second variable, and applies the function (any function)

tapply(X = surveyrbioc$Q1, INDEX = surveyrbioc$Q2, FUN = median)
# Master student/else         PhD student             Postdoc 
#                26.0                28.5                34.0

lapply and sapply

Back to our iris dataset

names(iris)
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

We want the average sepal length and width, and the same for the petals. Uh, and we want the standard deviation too.

# the unefficient way:
seplen_m <- mean(iris$Sepal.Length)
sepwid_m <- mean(iris$Sepal.Width)
petlen_m <- mean(iris$Petal.Length)
petwid_m <- mean(iris$Petal.Width)

seplen_m <- sd(iris$Sepal.Length)
# ... and so on

\(\rightarrow\) Apply a Function over a List or Vector

# we will use just the first four columns
lapply(iris[, 1:4], mean)
# $Sepal.Length
# [1] 5.843333
# 
# $Sepal.Width
# [1] 3.057333
# 
# $Petal.Length
# [1] 3.758
# 
# $Petal.Width
# [1] 1.199333
sapply(iris[, 1:4], mean)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#     5.843333     3.057333     3.758000     1.199333
lapply(iris[, 1:4], sd)
# $Sepal.Length
# [1] 0.8280661
# 
# $Sepal.Width
# [1] 0.4358663
# 
# $Petal.Length
# [1] 1.765298
# 
# $Petal.Width
# [1] 0.7622377
# ...

The major difference is in the presentation of the output

3.9 A very good friend: get a summary of your data

and some other friends…

Try out summary on a data.frame

summary(iris)
#   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
#  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
#  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
#  Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
#  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
#  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
#  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Alternatives in other packages:

  • describe() in the Hmisc package
  • skim() from skimr
  • create_report() from Data Explorer

table

table(surveyrbioc$Q3)
# 
#         good intermediate         none         poor 
#            3            3            8            8
table(surveyrbioc$Q4)
# 
#         good intermediate         none         poor          pro 
#            2            6            4            9            1

table(surveyrbioc$Q2, surveyrbioc$Q3)
#                      
#                       good intermediate none poor
#   Master student/else    0            1    1    4
#   PhD student            0            1    6    3
#   Postdoc                3            1    1    1
  • want the sums? Try addmargins()
  • looking for the percentage values? prop.table()
  • somewhat nicer output: ftable()
addmargins(table(surveyrbioc$Q2, surveyrbioc$Q3))
#                      
#                       good intermediate none poor Sum
#   Master student/else    0            1    1    4   6
#   PhD student            0            1    6    3  10
#   Postdoc                3            1    1    1   6
#   Sum                    3            3    8    8  22
prop.table(table(surveyrbioc$Q2, surveyrbioc$Q3))
#                      
#                             good intermediate       none       poor
#   Master student/else 0.00000000   0.04545455 0.04545455 0.18181818
#   PhD student         0.00000000   0.04545455 0.27272727 0.13636364
#   Postdoc             0.13636364   0.04545455 0.04545455 0.04545455

Please always do check the docs!

3.10 Exercise session 4

The MASS package contains the dataset Cars93, which stores the data on 93 makes of car sold in US

  • you’ll need the package and the data
  • Type specifies the type of market the car is aimed at. Find the cheapest car in each type, and the one with the greatest fuel efficiency
  • compute the mean horsepower for each type
  • create two data.frames, one for US cars, the other one with non-US cars
  • export the US cars to a text file
  • save the non-US cars data to a binary file (.RData)

3.10.1 Exercise Session 4 - Solutions

Click on this to display the solution
library(MASS)
head(Cars93)
#   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway            AirBags
# 1        Acura Integra   Small      12.9  15.9      18.8       25          31               None
# 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25 Driver & Passenger
# 3         Audi      90 Compact      25.9  29.1      32.3       20          26        Driver only
# 4         Audi     100 Midsize      30.8  37.7      44.6       19          26 Driver & Passenger
# 5          BMW    535i Midsize      23.7  30.0      36.2       22          30        Driver only
# 6        Buick Century Midsize      14.2  15.7      17.3       22          31        Driver only
#   DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
# 1      Front         4        1.8        140 6300         2890             Yes               13.2
# 2      Front         6        3.2        200 5500         2335             Yes               18.0
# 3      Front         6        2.8        172 5500         2280             Yes               16.9
# 4      Front         6        2.8        172 5500         2535             Yes               21.1
# 5       Rear         4        3.5        208 5700         2545             Yes               21.1
# 6      Front         4        2.2        110 5200         2565              No               16.4
#   Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
# 1          5    177       102    68          37           26.5           11   2705 non-USA
# 2          5    195       115    71          38           30.0           15   3560 non-USA
# 3          5    180       102    67          37           28.0           14   3375 non-USA
# 4          6    193       106    70          37           31.0           17   3405 non-USA
# 5          4    186       109    69          39           27.0           13   3640 non-USA
# 6          6    189       105    69          41           28.0           16   2880     USA
#            Make
# 1 Acura Integra
# 2  Acura Legend
# 3       Audi 90
# 4      Audi 100
# 5      BMW 535i
# 6 Buick Century
?Cars93
tapply(X = Cars93$Min.Price, INDEX = Cars93$Type, FUN = min)
# Compact   Large Midsize   Small  Sporty     Van 
#     8.5    17.5    12.4     6.7     9.1    13.6
tapply(X = Cars93$Horsepower, INDEX = Cars93$Type, FUN = mean)
#  Compact    Large  Midsize    Small   Sporty      Van 
# 131.0000 179.4545 173.0909  91.0000 160.1429 149.4444

table(Cars93$Origin)
# 
#     USA non-USA 
#      48      45
us_cars <- Cars93[Cars93$Origin == "USA", ]
nonus_cars <- Cars93[Cars93$Origin != "USA", ]
# write.csv(us_cars, file = "us_cars.csv")
# save(nonus_cars, file = "nonus_cars.RData")

4 Step 4: Plotting data

4.1 Graphics in R

  • powerful environment for visualizing scientific data
  • integrated graphics AND statistics
  • publication-ready quality
  • fully programmable, highly reproducible

Many ways for the same task:

  • base graphics (plot)
  • ggplot2
  • lattice
  • interactive visualizations such as plotly, ggvis or other libraries

Why bother plotting at all?

  • facilitate comparisons
  • identify trends
  • generate hypotheses

4.2 The plot function

First thing: take a look at the overview documentation of plot

?plot
# Help on topic 'plot' was found in the following packages:
# 
#   Package               Library
#   base                  /Library/Frameworks/R.framework/Resources/library
#   graphics              /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
# 
# 
# Using the first match ...

We will see

  • scatter plots
  • boxplots
  • barplots
  • histograms

4.3 plot parameters

Required:

  • x variable
  • y variable

Other options

  • title with main
  • axes labels with xlab and ylab
  • axes limits with xlim and ylim
  • symbols, colors and sizes: pch, col and cex - as atomic elements or as vectors

4.4 Get to know the data: mpg

library(ggplot2) # this is useful per se, and contains the dataset we will be using
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov
# works on RStudio
# View(mpg)
# otherwise stick to the classic
str(mpg)
# tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
#  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#  $ drv         : chr [1:234] "f" "f" "f" "f" ...
#  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#  $ fl          : chr [1:234] "p" "p" "p" "p" ...
#  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Make a guess: what do you expect to see between fuel consumption and engine size?

4.5 Scatter plots

plot(mpg$displ, mpg$cty)

Bonus: what is the correlation?

cor(mpg$displ, mpg$cty)
# [1] -0.798524
cor(mpg$displ, mpg$cty, method = "spearman")
# [1] -0.8809049

4.5.1 Can we do more?

mpg$mygroup <- as.numeric(factor(mpg$class))
plot(mpg$displ, mpg$cty,
  col = mpg$mygroup
)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)

plot(mpg$displ, mpg$cty,
  pch = as.numeric(factor((mpg$class)))
)

This shows we have quite some overlap of points. What can we do?

Adding some jitter…

plot(
  x = mpg$displ + rnorm(nrow(mpg), mean = 0, sd = 0.01),
  y = mpg$cty + rnorm(rnorm(nrow(mpg), mean = 0, sd = 0.01)),
  col = mpg$mygroup,
  main = "now with jitter!"
)

Adding a smoothing line

Trying to see a pattern? Add a smoothing curve.

This one is wrong - missing the reordering of points

plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
lines(mpg$displ, myfit)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)

This one is correct!

plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
myord <- order(mpg$displ)
lines(mpg$displ[myord], myfit[myord])
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)

lines can add (almost) anything (any line).

points works in a similar way to superimpose, well, points

4.6 Bar charts

?barplot
academia_levels <- table(surveyrbioc$Q2)
barplot(academia_levels)

4.7 Boxplots

How is the age distributed across academic levels? Check the help of boxplot

  • A formula is required!
  • Don’t worry, it’s nothing but your y~x variables - ok, it can get more complicated
boxplot(Q1 ~ Q2,
  data = surveyrbioc
)

Splitting on more factors

boxplot(Q1 ~ Q2 + Q3,
  data = surveyrbioc
)

Making it more readable…

boxplot(Q1 ~ Q2 + Q3,
  data = surveyrbioc,
  las = 2
)

Changing the parameters allows you to control many aspects on plot appearance par is your best friend - and enemy (see ?par)

par(mar = c(15, 3, 2, 2))
boxplot(Q1 ~ Q2 + Q3, data = surveyrbioc, las = 2)

par( ... ) has many arguments; here, the useful/most used ones

  • mar for handling the margins
  • cex, col, pch and co. are all parameters of par
  • las to change the style of the axis labels
  • mfrow to draw an array of figures

4.8 Histograms

hist(surveyrbioc$Q1, breaks = 8)

4.9 More histograms!

hist(mpg$cty, breaks = 10)

hist(mpg$cty, breaks = 10, col = "steelblue")

hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray")

hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray", main = "Distribution of miles/gallon consumption in city traffic")

4.10 How to do nice pie charts

DON’T.

If you really need to do it…

?pie
example(pie) # expecially the last one
# 
# pie> require(grDevices)
# 
# pie> pie(rep(1, 24), col = rainbow(24), radius = 0.9)

# 
# pie> pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
# 
# pie> names(pie.sales) <- c("Blueberry", "Cherry",
# pie+     "Apple", "Boston Cream", "Other", "Vanilla Cream")
# 
# pie> pie(pie.sales) # default colours

# 
# pie> pie(pie.sales, col = c("purple", "violetred1", "green3",
# pie+                        "cornsilk", "cyan", "white"))

# 
# pie> pie(pie.sales, col = gray(seq(0.4, 1.0, length.out = 6)))

# 
# pie> pie(pie.sales, density = 10, angle = 15 + 10 * 1:6)

# 
# pie> pie(pie.sales, clockwise = TRUE, main = "pie(*, clockwise = TRUE)")

# 
# pie> segments(0, 0, 0, 1, col = "red", lwd = 2)
# 
# pie> text(0, 1, "init.angle = 90", col = "red")
# 
# pie> n <- 200
# 
# pie> pie(rep(1, n), labels = "", col = rainbow(n), border = NA,
# pie+     main = "pie(*, labels=\"\", col=rainbow(n), border=NA,..")

# 
# pie> ## Another case showing pie() is rather fun than science:
# pie> ## (original by FinalBackwardsGlance on http://imgur.com/gallery/wWrpU4X)
# pie> pie(c(Sky = 78, "Sunny side of pyramid" = 17, "Shady side of pyramid" = 5),
# pie+     init.angle = 315, col = c("deepskyblue", "yellow", "yellow3"), border = FALSE)

pie(c(20, 80),
  init.angle = -40,
  col = c("white", "yellow"),
  label = c("no pacman", "pacman"),
  border = "lightgrey"
)

… or switch from pie to waffle (seriously)

4.11 How to do 3D exploded pie charts

DON’T. And this time I mean it

sadly enough there would be packages for this, too

4.12 Extra: dynamite plots

a.k.a. Why is this bad?

age_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, mean)
sd_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, sd)
mybar <- barplot(age_by_group, col = c("khaki", "salmon", "firebrick"), ylim = c(0, max(age_by_group) + 5))
# mybar, inspect it
arrows(mybar, age_by_group, mybar, (age_by_group + sd_by_group), length = 0.15, angle = 90)

Dynamite plots VS boxplots

boxplot(Q1 ~ Q2,
  data = surveyrbioc
)

Median VS distribution VS actual points… What do you really want to show?

4.13 What can you do more with your plot?

  • change the points type - see type in ?plot
  • use log scales - see log
  • annotate (some of) the points - with text
  • change font sizes, styles and so on
  • use special characters with expression
  • save the plot
    • use the point-and-click interface in RStudio
    • code it

4.13.1 Saving your plots

General code structure for this

opendevice()
...
code for the plot
...
closedevice()
pdf("myfilename.pdf")
# see also alternatives:
## png()
## jpeg()
plot(mpg$displ, mpg$cty,
  col = mpg$mygroup
)
dev.off()

4.14 Petals and sepals

4.15 Exercise session 5

Back to the iris. Three species are there. Explore the dataset in the following ways:

  • draw a histogram of the petal length. What do you see?

  • plot sepal length versus petal length. Add different colors to highlight the species

  • do the same for sepal width and sepal length, and this time use a different symbol for the species. Add a legend and a title if you want

  • (harder) calculate the mean values of each feature for each species, organizing it in a matrix where the rows are the species names. Generate a stacked bar plot with it, and another one where the bars are arranged horizontally

  • feel free to go back to the survey data and explore it further!

4.15.1 Exercise Session 5 - Solutions

Click on this to display the solution
hist(iris$Petal.Length)

plot(iris$Sepal.Length, iris$Petal.Length)

plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species)

plot(iris$Sepal.Width, iris$Sepal.Length, pch = as.numeric(factor(iris$Species)))
legend("topright", legend = levels(factor(iris$Species)), pch = unique(factor(iris$Species)))


sl_means <- tapply(iris$Sepal.Length, iris$Species, mean)
pl_means <- tapply(iris$Petal.Length, iris$Species, mean)
sw_means <- tapply(iris$Sepal.Width, iris$Species, mean)
pw_means <- tapply(iris$Petal.Width, iris$Species, mean)
mymat <- cbind(sl_means, pl_means, sw_means, pw_means)
barplot(mymat, legend.text = unique(iris$Species))

barplot(mymat, beside = TRUE, legend.text = unique(iris$Species))

4.16 Something cool to have an overview…

pairs(iris[, 1:4], col = iris$Species)

You can use the panels even more cleverly, check the help of pairs!

This is a collection on graphs in R - with the underlying code too.

http://shiny.stat.ubc.ca/r-graph-catalog/

4.17 The gapminder project

Sit and relax/enjoy…

https://www.youtube.com/watch?v=hVimVzgtD6w

4.18 Meet ggplot2

But first, meet the gapminder data

library(gapminder)
head(gapminder)
# # A tibble: 6 × 6
#   country     continent  year lifeExp      pop gdpPercap
#   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
# 1 Afghanistan Asia       1952    28.8  8425333      779.
# 2 Afghanistan Asia       1957    30.3  9240934      821.
# 3 Afghanistan Asia       1962    32.0 10267083      853.
# 4 Afghanistan Asia       1967    34.0 11537966      836.
# 5 Afghanistan Asia       1972    36.1 13079460      740.
# 6 Afghanistan Asia       1977    38.4 14880372      786.
head(country_colors)
#          Nigeria            Egypt         Ethiopia Congo, Dem. Rep.     South Africa 
#        "#7F3B08"        "#833D07"        "#873F07"        "#8B4107"        "#8F4407" 
#            Sudan 
#        "#934607"
head(continent_colors)
#    Africa  Americas      Asia    Europe   Oceania 
# "#7F3B08" "#A50026" "#40004B" "#276419" "#313695"

Variables:

  • country
  • continent
  • year
  • lifeExp, life expectancy at birth
  • pop, total population
  • gdpPercap, per-capita GDP

4.19 The ggplot2 philosophy

gg stands for the grammar of graphics

  • you provide the data
  • you map the data to aesthetics (shape, size, colour)
  • you add geoms to specify how you want to have the data plotted
  • you can have statistical transformations
  • facets allow you to do quick elegant multi plots

It can come across somewhat harder since

  • data need to be tidy - one observation per row
  • requires an extra step for abstraction

yet, it makes the whole process of “thinking data” more natural.

4.19.1 A quick dive into the many options

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

We can store ggplot plot objects into a variable - and build upon that later

p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
p + geom_point()

p + geom_point() + scale_x_log10()

p <- p + scale_x_log10()
p + geom_point(color = "blue")

p + geom_point(color = "steelblue", pch = 19, size = 8, alpha = 1 / 4)

p + geom_point(aes(color = continent))

p + geom_point(aes(col = continent), size = 4)

p + geom_point(aes(col = continent, size = pop))

p + geom_point(aes(col = continent, size = pop)) + geom_smooth()

niceone <- p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(aes(col = continent), se = FALSE)
niceone

p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(lwd = 2, se = FALSE, method = "lm", col = "red")

p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(aes(col = continent), lwd = 2, se = FALSE, method = "lm")

p + geom_point() + facet_wrap(~continent)

p + geom_point(aes(col = continent)) + facet_wrap(~continent)

p + geom_point(aes(col = continent)) + geom_smooth() + facet_wrap(~continent)

4.19.2 Saving the plots

Becomes de facto easier - works on “plot objects”.

ggsave(file = "myplot.png")

4.19.3 Line plots

ggplot(
  gapminder,
  aes(x = year, y = lifeExp, group = country, color = country)
) +
  geom_line(lwd = 1, show.legend = FALSE) +
  scale_color_manual(values = country_colors) +
  theme_bw() +
  theme(strip.text = element_text(size = rel(1.1)))

bp <- ggplot(
  gapminder,
  aes(x = year, y = lifeExp, group = country, color = country)
) +
  geom_line(lwd = 1, show.legend = FALSE) +
  facet_wrap(~continent) +
  scale_color_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1)))
bp

One step away from making things interactive - more interesting? more accessible? more explorable?

plotly::ggplotly(bp)

4.19.4 Boxplots

# now it is a categorical x VS continuous y
p <- ggplot(gapminder, aes(x = continent, y = lifeExp))
p + geom_point()

p + geom_point(alpha = 1 / 4)

It is so easy to escape dynamite plots!

p + geom_jitter()

p + geom_jitter(aes(col = continent))

p + geom_boxplot()

p + geom_boxplot() + geom_jitter(alpha = 1 / 2)

4.19.5 Histograms

p <- ggplot(gapminder, aes(lifeExp))
p + geom_histogram()

p + geom_histogram(binwidth = 1)

Stacked histogram are much easier in this framework

p + geom_histogram(aes(color = continent))

p + geom_histogram(aes(fill = continent))

p + geom_histogram(aes(fill = continent), position = "identity")

… and so is the superimposing of more than one distribution

p + geom_histogram(aes(fill = continent), position = "identity", alpha = 0.4)

Similar to histogram, you can use also density plots

p + geom_density(aes(fill = continent), alpha = 1 / 4)

4.19.6 Themes: a quick way to put a new shirt on

niceone + theme_bw()

niceone + theme_void()

If you really really really have to…

library("ggthemes")
niceone + theme_excel() + scale_color_excel()

4.20 Exercise session 6 - Homework if you want

  • try to recreate the plots you did with base graphics, this time using ggplot2

  • pick a nice plot you would like to have in your next manuscript: can you think of what you need to do it? I am talking of

    • what data type?
    • what transformations?
    • what plot type/layer?
Click on this to display the solution

5 Bonus Step: reproducible reports

Our aims:

  • Understand what R Markdown is and why you should use it
  • Learn how to construct an R Markdown file
  • Export an R Markdown file into many file formats
  • \(\rightarrow\) You are all set to use Rmd to document any of your analyses!

5.1 Reproducible reports with R Markdown

R Markdown allows you to create documents that serve as a neat record of your analysis.

Why?

  • we want other researchers to easily understand what we did in our analysis, otherwise nobody can be certain that you analysed your data properly (yay, reproducible research!)
  • create an R markdown document as an appendix to a paper or project assignment, upload it to an online repository such as Github, or simply to keep as a personal record (future you will thank present you for this)

The key point is…

R Markdown documents present your code alongside its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook. To do this, R Markdown uses markdown syntax.

5.2 Markdown

Markdown is a very simple markup language which provides methods for creating documents with headers, images, links etc. from plain text files, while keeping the original plain text file easy to read.

You can convert Markdown documents to many other file types like .html or .pdf to display the headers, images etc..

It might sound complicated. But really isn’t,

First things first: install the required software

  • R and RStudio (guess you have it already)
install.packages("rmarkdown")
library(rmarkdown)
  • knitr comes along, pandoc too. You should quickly be all set!

5.3 Basics of markdown

ABC here, let’s go through it:

http://rmarkdown.rstudio.com/authoring_basics.html

plus… a beautiful cheat sheet is there for you!

http://rmarkdown.rstudio.com/lesson-15.html

http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf

Using LaTeX? No problem, you can use \(\LaTeX\) here as well!

\(\left( f(x) = \sum_{i=0}^{n} \frac{a_i}{1+x} \right)\)

What can you do with R markdown?

http://rmarkdown.rstudio.com/gallery.html

http://rmarkdown.rstudio.com/formats.html

5.4 Let’s create one together!

5.4.1 Why is Rmd better than R

The price to pay to have an Rmd document is sooooo small - and for that, you get

  • code, text, output all together
  • one file only - no need to get lost
  • it even looks nice :)

5.4.2 Create an R Markdown file

To create a new R Markdown file (.Rmd), select File -> New File -> R Markdown... in RStudio, then choose the file type you want to create.

The newly created .Rmd file comes with basic instructions but we want to create our own R Markdown script, so let’s get to know the different parts of an Rmd file

  • An (optional) YAML header surrounded by ---s
  • R code chunks surrounded by backticks (```)
  • text mixed with simple text formatting

5.4.3 Inserting figures

Uh, you can insert figures also like this

![](images/grcat.png)

5.5 Insert text and code - any text, any code


```r
n <- 10
rnorm(n)
#  [1] -0.14014756  0.80533395 -0.52655265 -0.20807110  0.97189242  1.69874146 -0.99006732  0.39609724
#  [9]  0.90447946  0.04883601
```

Shortcut: Ctrl + Alt + I

Input code: you can use multiple languages including R, Python, and SQL, many more (specify the language in the chunk options)

Inline code can be added with `r 1+1`

5.5.1 Chunk options

Deatiled very nicely here: https://yihui.name/knitr/options/

A simple set of options which you can use for many documents:

set.seed(42)
knitr::opts_chunk$set(
  comment = NA,
  fig.align = "center",
  fig.width = 7,
  fig.height = 7,
  warning = FALSE,
  eval = TRUE
)

5.5.2 Knit!

Use the Knit button in the RStudio IDE to render the file and preview the output with a single click or keyboard shortcut (Ctrl + Shift + K).

To generate a report from the file, run the render command (works also outside of RStudio):

library("rmarkdown")
rmarkdown::render("yourfile.Rmd")

It was a deep dive, but now…

  • You are familiar with the Markdown syntax and code chunk rules.
  • You can include figures and tables in your Markdown reports.
  • You can create R Markdown files and export them to pdf or html files.

5.7 Exercise session Bonus

  • create a new R Markdown document
  • can you find out how to generate a word document as output?
  • insert some code you previously used for exploring the small survey data - remember, a fresh session is run when knitting, so you need the commands from the very start!

5.7.1 Exercise Session Bonus - Solutions

Click on this to display the solution
  • File -> New File -> R Markdown... in RStudio
  • add this in the yaml header
output:
  word_document

Useful material

Made @IMBEI

Books

Courses

Misc

Session Info

As a lesson, do take note/report the versions of R/the different packages you used for your analysis.

Why?

Click on this to display the solution
  • You will (should) report these in your Material & Methods
  • You credit the work of others - as a direct consequence
  • Different package versions might lead to different results
  • It is a good scientific research practice!
  • It helps others help you - in case you want to post your question on StackOverflow or similar
  • Because one command just does it for you
sessionInfo()
# R version 4.3.0 (2023-04-21)
# Platform: x86_64-apple-darwin20 (64-bit)
# Running under: macOS Monterey 12.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: Europe/Berlin
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] rmarkdown_2.25  ggthemes_5.0.0  gapminder_1.0.0 ggplot2_3.4.4   MASS_7.3-60.0.1
# 
# loaded via a namespace (and not attached):
#  [1] tidyr_1.3.1       plotly_4.10.4     rappdirs_0.3.3    sass_0.4.8        utf8_1.2.4       
#  [6] generics_0.1.3    icons_0.2.0       xml2_1.3.6        lattice_0.22-5    stringi_1.8.3    
# [11] digest_0.6.34     magrittr_2.0.3    evaluate_0.23     grid_4.3.0        timechange_0.3.0 
# [16] bookdown_0.37     fastmap_1.1.1     Matrix_1.6-5      jsonlite_1.8.8    formatR_1.14     
# [21] httr_1.4.7        mgcv_1.9-1        purrr_1.0.2       fansi_1.0.6       crosstalk_1.2.1  
# [26] viridisLite_0.4.2 scales_1.3.0      lazyeval_0.2.2    jquerylib_0.1.4   cli_3.6.2        
# [31] rlang_1.1.3       crayon_1.5.2      ellipsis_0.3.2    splines_4.3.0     munsell_0.5.0    
# [36] withr_3.0.0       cachem_1.0.8      yaml_2.3.8        tools_4.3.0       dplyr_1.1.4      
# [41] colorspace_2.1-0  assertthat_0.2.1  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
# [46] lubridate_1.9.3   stringr_1.5.1     htmlwidgets_1.6.4 pkgconfig_2.0.3   bslib_0.6.1      
# [51] pillar_1.9.0      gtable_0.3.4      data.table_1.15.0 glue_1.7.0        xfun_0.41        
# [56] tibble_3.2.1      tidyselect_1.2.0  highr_0.10        emo_0.0.0.9000    rstudioapi_0.15.0
# [61] knitr_1.45        farver_2.1.1      nlme_3.1-164      htmltools_0.5.7   labeling_0.4.3   
# [66] compiler_4.3.0
---
title: >
  Introduction to R: first steps, statistics, graphics
subtitle: >
  CTH Course Series on Statistics - 2024
  <p align="center">
  <a href="https://www.unimedizin-mainz.de/imbei/">
  IMBEI @ <img src="images/combilogo.png" alt="" height="70"/>
  <!-- <img src="images/ticardio-logo.png" alt="" height="50"/> -->
  </a>
  </p>
  <br>
author:
- name: Najla Abassi (abassina@uni-mainz.de)<br><a href="https://www.unimedizin-mainz.de/imbei/">IMBEI, University Medical Center Mainz</a><br>
- name: Annekathrin Nedwed (anneludt@uni-mainz.de)<br><a href="https://www.unimedizin-mainz.de/imbei/">IMBEI, University Medical Center Mainz</a><br>
- name: <a href="https://federicomarini.github.io">Federico Marini (marinif@uni-mainz.de)</a><br><a href="https://www.unimedizin-mainz.de/imbei/">IMBEI, University Medical Center Mainz</a><br><a href="https://twitter.com/FedeBioinfo">`r icons::fontawesome('twitter')` `@FedeBioinfo`</a>
date: "2024/02/21"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    theme: cosmo
    code_folding: show
    code_download: true
editor_options: 
  chunk_output_type: inline
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#",
  error = FALSE,
  warning = FALSE,
  message = FALSE
)
options(width = 100)
```

This document contains a comprehensive workflow for the course **Introduction to R: first steps, statistics, graphics**, offered in the context of the CTH Course Series on Statistics - 2024.

The content will follow closely the slides used during the course itself, with particular highlighting regarding the code and its output. This should allow you to replicate all the steps, constituting a starting base for extending the commands e.g. to your own datasets.


# Table of contents {.unnumbered}

- Step 0: First things first: know your tools, R & RStudio
- Step 1: Basics of R, RStudio, help, packages
- Step 2: Data in, data out
- Step 3: Analyzing (tabular) data: describe, explore, transform, summarise data
- Step 4: Plotting data
- Bonus Step: reproducible reports, i.e. escaping "the R script"

**Schedule (tentatively)**

9:00-10:00 --- Step 0 & Bonus: How to make the most of this
10:00-12:30 --- Step 1, 2, 3 (with short breaks in between)  
12:30-13:30 --- *Lunch break*  
13:30-15:30 --- Step 3, 4, Q&As  


# Step 0: You, R and RStudio, knowing you and knowing your tools {#step0 .unnumbered}

Let's get to know each other a bit...

## What is R? Why should I use R? 

Available at `https://www.r-project.org/`

- *free* statistical environment for interactive use
- intepreted, functional scripting/programming language - you type in, you see output
- descends from the S language, written by statisticians for statisticians
  
What can you do with R?

- Anything!
  - do calculations
  - write functions
  - analyse data. ALL the data. Well, almost. But really, almost.
  - apply advanced statistical techniques
  - do beautiful & publication-ready plots
  - develop interactive web-applications
  - presentations & documents (this one)

Why should I use R?

- it works! And it is quite powerful
- it is free, open-source, and available for all OS
- you can *really* do whatever you might aim to do in terms of statistics
- it offers awesome possibilities for (interactive) graphing
- it has a wide, active and competent community (ok, communities: statistics, bioinformatics, machine learning, ...)
- it can be extended with packages. More power!
- escape Point-and-Click-land, you work with syntax: you can use, re-use elements, validate & reproduce analysis

Why should I *not* use R?

- you use it or lose it
- the learning curve might be steep
- can be frustrating if you have errors
- help might be available, but it is very technical
- many packages, a blessing and a curse: how many, how good 

## Let's get started!

Get R - and RStudio

- go to https://cloud.r-project.org and get the latest version
- go to https://posit.co/products/open-source/rstudio/ and download RStudio
- ... or use your text editor of choice

Alternatives: 

- OpenAnalytics Architect (https://www.getarchitect.io/)
- Visual Studio Code (https://code.visualstudio.com/)
- Emacs Speaks Statistics...


## First ride: look around you

Open up RStudio - You'll have four panes

- the Source for your scripts and documents (top-left, in the default layout)
- your Environment/History (top-right),
- your Files/Plots/Packages/Help/Viewer (bottom-right), and
- the R Console (bottom-left).

Want to customize this?

Tools $\rightarrow$ Global Options $\rightarrow$ Pane Layout

Advantages of an IDE

- all in one window!
- keyboard shortcuts, autocompletion, highlighting $\rightarrow$ type easier, do less errors


## Folder structure

It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory.

*Why?*

* Easier to have "self-contained" research units!
* A project "does not interfere" with other projects
* Gives a structure, easier to find things, use, reuse
* Someone else (including future you) can understand what goes on

*How?*

RStudio projects!  
Custom settings, per project.

Let's create one live now - and have the workspace NOT saved

*What is the best structure?*

One, used consistently - not gonna touch on naming things as it can get hot quickly :)

With R...

`dir.create()`

`file.edit()`


*Where am I doing things?*

The working directory is the place from where R will be looking for and saving the files. 

`getwd()`/`setwd()`, but not in your scripts (fails on others' computers!)

## Interacting with R

Instructions, commands.

Scripts, console - use the editor and have a complete record on what you did!

Shortcuts FTW!

Even better: Reproducible documents, with R Markdown

Nice resources on top:
* RStudio cheatsheet about the RStudio IDE!
* the internet/rstats community!


## Seeking help

* ?
* ??
* built-in RStudio help interface - and shortcuts!


### Where to ask for help?

* your neighbour - if COVID-conform, do interact within each other!
* your colleagues
* https://rdocumentation.org/ website 
* the web: google, StackOverflow

The main point: describe well your problem, "help others help you"

Others need to reproduce your error to help you better: `saveRDS()`, `dput()`, `sessionInfo()`


## R packages

R packages...

- are fundamental components of R ecosystem
- extend base R functionality for a specific purpose 
- bundle new functions, data sets, and documentation
- are contributed by independent developers
- have dependency management

Repositories:

- CRAN: Managed official package repository network 
- Bioconductor: curated bioinformatics packages (vignettes mandatory, integrated ecosystem!)
- GitHub: un-managed, bleeding edge - but also excellent ones ("just not on CRAN")


Our contributions so far:

- flowcatchR `https://bioconductor.org/packages/flowcatchR/`
- pcaExplorer `https://bioconductor.org/packages/pcaExplorer/`
- ideal `https://bioconductor.org/packages/ideal/`
- GeneTonic (`https://bioconductor.org/packages/GeneTonic`)
- iSEE (`https://bioconductor.org/packages/iSEE`)
- simrec (for simulating recurring events, https://cran.r-project.org/web/packages/simrec/)


###  How to use packages

- **Install**: once (for every major R version)
- **Load**: in each session
- **Use** like any base R functionality

For Bioconductor packages...

```{r, eval=FALSE}
install.packages("BiocManager")
```

```{r, eval=FALSE}
library("BiocManager")
BiocManager::install()
```


Relevant commands: 

- `install.packages("packagename")` - check it online at CRAN!
- `installed.packages()`
- `.libPaths()`
- `update.packages()`
- `library("packagename")`
- `help(package="packagename")`, `data()`, `browseVignettes()`, `vignette()`, `citation("packagename")`

Something you might have already done, if you worked on RNA-seq data:

```{r eval=FALSE}
BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")
```

## How should I use this material?

Use this very same text document and expand this!

Why? Let's have a look first into the Bonus Content, actually: we navigate to Section \@ref(reproreports) and continue from there.


# Step 1: Introduction to R {#step1}

Here we will touch on the first commands in R, so that you can

* Define the following terms as they relate to R: object, assign, call, function, arguments, options.
* Assign values to objects in R.
* Learn how to name objects
* Use comments to inform script.
* Solve simple arithmetic operations in R.
* Call functions and use arguments to change their default options.
* Inspect the content of vectors and manipulate their content.
* Subset and extract values from vectors.
* Analyze vectors with missing data.


## R is a powerful calculator

... but not just that.

Type the following

```{r}
2 + 2
log(2)
347 * 73841
7 / 2
7 %/% 2
7 %% 2
```

What did these commands do?

## `r emo::ji("notes")` Help! `r emo::ji("notes")`

```{r}
# this calls the help for a function to plot a histogram
?hist
# this is just the same
help(hist) ## what about ??
```

```{r}
?apropos
apropos("row")
```

- integrated help system, with executable examples
- (for some packages) vignettes (typical problem, commands, and workflow)
- CRAN Task Views: https://cran.r-project.org/web/views/
- Books!
- Courses!
- Online: mailing lists, forums (StackOverflow, ...), blogs, Twitter (`#rstats`)


## Your starting vocabulary - a.k.a. Exercise Session 0

- `getwd()` and `setwd()` - Tab is your friend
- ` <- `, ` = `: the assignment operator
- `ls()`, `rm()`
- `str()`
- `example()`, `help()`/`?[function]`
- `print()`
- `q()`/`quit()`
- logical operators: `TRUE`,`FALSE`,`!`,`==`,`!=`,`<`,`>`,`<=`,`>=`,`|`,`&`,`xor()`
- `c()`
- data have help items too: e.g. `cars`

Find out what these do!

### Exercise Session 0 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
?getwd
?setwd
?`<-`
help(ls)
help(rm)
?str
?example
help(help)
?print
help(quit)
?c
?cars
```

</details>

## Make your life easier - Notes for your future self

- add comments and document your own code

```{r}
# This is a comment
```

- write clean code - use spaces, indentation
- use an editor with syntax highlighting/some form of autocompletion 

Careful here:

- R is case sensitive and has zero-tolerance with mis-spelled names
- parenthesis: open *and* close them
- special attention with missing values, factors VS strings: R is clever, but you might think differently
- do not be stingy with parentheses - if this helps you
- same goes with comments - your colleagues and your future self will thank you


## Exercise session 1

First things first:

* Grab some mini-postit (for those of you participating in person)!
* Get ready to use the chat & the reactions 

Then:

- find out more about the `iris` dataset. What is it about at all? How many variables are included? How many observations?
- `rep`licate! find out a function that `rep`licates elements of a vector to produce this

```
1 1 1 1
```

BONUS: ... and this 
 
```
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
```

### Exercise Session 1 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
rep(1, 4)

rep(c(1, 2, 3, 4, 5), 3)
```

</details>

## Data types 

R can recognize different general types of data

- numbers (`numeric`)
- character strings (text)
- logical (e.g. `class(TRUE)`)
- factors ("integers with a set of labels") - it is categorical data!

- special ones: dates, time, ...
 
When and where to use which? 


## I was curious about the participants

In a previous edition of a similar course, I asked the future participants:

- How old are you?
- What is your current academic level? (PI, Postdoc, PhD, master)
- What is your current knowledge level of R? (pro-good-intermediate-poor-none)
- What is your knowledge of programming languages in general?
- What is your experience level with genomics and RNA-seq data?
- How familiar are you with mogon and parallel computing? (I am a regular user/Once in a while i used it/I know it exists/I heard we had some servers around/Is this supposed to be in the cloud?!)
- What are your expectations from the course? 


# Step 2: Data in, data out {#step2}

## Importing data in R

80-20? 90-10? Import, clean, prepare, transform your data

Sources:

- Files, Clipboard, URL
- **Plain text file: Comma-separated, tab-delimited, ...**
- R format file
- SAS / Stata / SPSS file: package `haven`
- Spreadsheet (Excel): package `readxl` - highly recommended!
- Database: RSQLite, RPostgreSQL, RMySQL, ...


## The vocabulary of importing

... and exporting

- `read.table()`,`write.table()` + `read.csv|delim`
- the option `stringsAsFactors=FALSE`
- `load()`,`save()`/`readRDS()`,`saveRDS()`
- via `haven` : `read_sas()`,`read_spss()` /`write_sas()`,`write_sav()`
- via `readxl`: `read_excel()`

Check out their documentation pages!

Other options: `rio`, RStudio GUI



## Take a look at the data 

Go to https://github.com/federicomarini/rbioc2016

$\rightarrow$ `inst/extdata`

$\rightarrow$ `survey_responses.csv`, in its raw format

You can load it directly like this

```{r}
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
```

Or install the package and load it from there

```{r eval=FALSE}
library("devtools")
install_github("federicomarini/rbioc2016")
library("rbioc2016")
data(surveyrbioc)
```


## Input data: Step by step, by hand?

Sometimes your data is either small and/or not in an Excel-like tabular format.

What to do? You `c`ombine the elements together!

```{r eval=TRUE}
Q1 <- c(28, 27, 33, 32, 29)
# should return this
Q1

Q2 <- c("PhD student", "PhD student", "Postdoc", "PhD student", "PhD student")
Q2
# ... and so on
```


## Combine the variables to a matrix

We have seen `c()`. We also have

- `cbind`
- `rbind`

```{r, eval=TRUE}
firstTwo <- cbind(Q1, Q2)
firstTwo
```

```{r}
rbind(Q1, Q2)
```

Is this what you wanted?

## Applying the first functions

But first, what can you do on these objects?

```{r error=TRUE}
sum(Q1)
sum(Q2)
summary(Q1)
summary(Q2)
str(Q1)
str(Q2)
mean(Q1)
dim(firstTwo)
firstTwo[, 1]
mean(firstTwo[, 1]) # Why, damn, why? Meet coercion
class(firstTwo)
```

## `matrix`, `data.frame` and `list`

- a `matrix` can contain one type of data - if numeric, you unleash all the matrix algebra power!
- a `data.frame` can store more types of data (one per column)
- a `list` is like a big box where you can put anything - but this is not always what you want

What is best?

Let's try with a `list`

```{r eval=TRUE}
Q3 <- c("intermediate", "poor", "good", "none", "intermediate")
mylist <- list(Q1, Q2, Q3)
mylist
```

```{r}
## access your elements with
mylist[[1]]
mylist[[1]][2]
```


How do we create a `data.frame`?

```{r eval=TRUE}
mydf <- data.frame(
  age = Q1,
  level = Q2,
  rexp = Q3
)
mydf
class(mydf$age)
```


### Exploring a `data.frame`

```{r}
mydf$age # it's all about the money :)
mydf[, 1]
names(mydf)
rownames(mydf)
dim(mydf)
nrow(mydf)
ncol(mydf)
```

```{r tidy=TRUE}
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
head(surveyrbioc)
tail(surveyrbioc)
names(surveyrbioc)
str(surveyrbioc)
summary(surveyrbioc)
surveyrbioc[, ]
```

Don't forget some nice built-in like the `View()` function, in RStudio - this is a nice alternative to "pure browsing in Excel".


## Exercise session 2

Using the `surveyrbioc` object:

- Calculate the mean age of the participants
- How many participants did actually take part to the survey?
- How old was the oldest participant? (`max` can be your help)
- `t`ranspose the survey data and assign it to another variable
- Change the column names of this object and save this data set as a tab-separated ASCII file
- BONUS: what was the youngest participant expecting?

### Exercise Session 2 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
mean(surveyrbioc$Q1)

max(surveyrbioc$Q1)

my_transposed_survey <- t(surveyrbioc)

surveyrbioc_mod <- surveyrbioc
colnames(surveyrbioc_mod) <- c("age", "level", "rlevel", "prog_level", "genomics_level", "parcomp_level", "expectation")

surveyrbioc_mod$expectation[which.min(surveyrbioc_mod$age)]
```

</details>

# Step 3: Analyzing (tabular) data {#step3}

Describe, explore, transform, summarise data

## Exploring, subsetting, manipulating, analysing

- `dim(x)` shows the dimensions of an object
- `str(x)` provides an overview of the structure of an object and the elements it contains
- `sum(x)`, `mean(x)`, `sd(x)` computes the sum, mean, or standard deviation of all the elements in `x`; `median(x)`, `quantile(x)`
- `length(x)` returns the number of elements in x (a vector)
- `sqrt(x)`, `log(x)` take the square root and the natural logarithm of a numeric - element or vector
- `hist(x, breaks=20, col="blue")` plots a histogram of variable x with 20 bins colored blue
- `unique(x)` returns the vector of unique elements in x
- `rm(x)` removes the object x from the environment (`rm(list=ls())` removes all objects)
- `sessionInfo()` prints information about R session and versions of all attached packages
- logical operators might often come handy!


## Subsetting the data

This is the basic way it works

```{r eval=FALSE}
surveyrbioc[ROWS, COLUMNS]
```

You can subset with...

- integers
- blank spaces
- names
- logical vectors

Try to make a guess, given this vector.

```{r eval=TRUE}
vec <- c(6, 1, 3, 6, 10, 5)
```

What happens if you do this?

```{r}
vec[2]
vec[c(5, 6)]
vec[-c(5, 6)]
vec > 5
vec[vec > 5]
```


What happens if you do this?

```{r}
df <- data.frame(
  name = c("John", "Paul", "George", "Ringo"),
  birth = c(1940, 1942, 1943, 1940),
  instrument = c("guitar", "bass", "guitar", "drums")
)

df

df[c(2, 4), 3]
df[, 1]
df[, "instrument"]
df$instrument
```

Back to the survey

```{r}
# I just want the age
surveyrbioc[, 1]
# or
surveyrbioc$Q1

# the first 4 columns
surveyrbioc[, c(1, 2, 3, 4)]
surveyrbioc[, 1:4]

# all but the last column
surveyrbioc[, -7]
# if you don't know we had 7 columns...
surveyrbioc[, -ncol(surveyrbioc)]

# you can subset with logical vectors, by row and by column
surveyrbioc[c(rep(TRUE, 10), rep(FALSE, 8)), ]
surveyrbioc[c(TRUE, FALSE), ] # keep in mind this behavior!

# guess what this does?
surveyrbioc$Q2 == "PhD student"
```

## Exercise session 3

- How many PhD students did reply?
- What was the proportion of PhD students to all other participants?
- How old were they on average?
- How many of the participants were older than 30?
- How many postdocs were younger than 35?
- How many of the participants did not reply to the last question? <!-- (`is.na` is your friend) -->

### Exercise Session 3 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
sum(surveyrbioc$Q2 == "PhD student")
mean(surveyrbioc$Q2 == "PhD student")
mean(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
sum(surveyrbioc$Q1 >= 30)
sum(surveyrbioc$Q1 < 35 & surveyrbioc$Q2 == "Postdoc")
sum(is.na(surveyrbioc$Q7))
```

</details>

## Manipulating and analysing your data

You can

- sort the data (see `sort` and `order`)
- transform your data: apply rules (formulas, logics, insight altogether)
- combine two datasets or more (if you `merge` them)
- do some statistics on your data


## Sorting the data

```{r eval=TRUE}
myord <- order(surveyrbioc$Q1)
myord

head(surveyrbioc[myord, 1:5], 4)
sorted_surv <- surveyrbioc[myord, 1:6]
```

`sort()` returns you the sorted data, `order()` the indices only


## Transforming the data 

```{r eval=TRUE}
# transforming a variable
newsurvey <- surveyrbioc[, 1:5]
newsurvey$ageroot <- sqrt(newsurvey$Q1)
head(newsurvey)

# creating groups out of a continuous variable
newsurvey$agegroup <- cut(newsurvey$Q1, breaks = c(20, 30, 40))
head(newsurvey)
```

Use case for `merge`: you have *two* sets you are playing with! Think in advance what you need for that purpose...


## We want statistics! 

Are PhD students *significantly* younger than postdocs? Are there any differences in the age of the three groups?

```{r}
phds <- surveyrbioc[surveyrbioc$Q2 == "PhD student", ]
postdocs <- surveyrbioc[surveyrbioc$Q2 == "Postdoc", ]
t.test(phds$Q1, postdocs$Q1)
aov(data = surveyrbioc, Q1 ~ Q2) # What is missing here?
```

Much more on this: in the next courses!


## Simple yet powerful functions

`tapply`

You want to calculate the median age of each academic group in here

```{r eval=TRUE}
md <- median(surveyrbioc$Q1)
md_master <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Master student/else"])
md_phd <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
md_postdocs <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Postdoc"])
c(md_master, md_phd, md_postdocs)
```

`tapply` splits the data of the first variable on the levels of the second variable, and applies the function (*any* function)

```{r eval=TRUE}
tapply(X = surveyrbioc$Q1, INDEX = surveyrbioc$Q2, FUN = median)
```


`lapply` and `sapply`

Back to our `iris` dataset

```{r eval=TRUE}
names(iris)
```

We want the average sepal length and width, and the same for the petals. Uh, and we want the standard deviation too.

```{r}
# the unefficient way:
seplen_m <- mean(iris$Sepal.Length)
sepwid_m <- mean(iris$Sepal.Width)
petlen_m <- mean(iris$Petal.Length)
petwid_m <- mean(iris$Petal.Width)

seplen_m <- sd(iris$Sepal.Length)
# ... and so on
```

$\rightarrow$ Apply a Function over a List or Vector

```{r}
# we will use just the first four columns
lapply(iris[, 1:4], mean)
sapply(iris[, 1:4], mean)
lapply(iris[, 1:4], sd)
# ...
```

The major difference is in the presentation of the output

## A very good friend: get a `summary` of your data

*and some other friends...*

Try out `summary` on a `data.frame`

```{r}
summary(iris)
```

Alternatives in other packages:

- `describe()` in the `Hmisc` package
- `skim()` from `skimr`
- `create_report()` from `Data Explorer`


`table`

```{r}
table(surveyrbioc$Q3)
table(surveyrbioc$Q4)

table(surveyrbioc$Q2, surveyrbioc$Q3)
```

- want the sums? Try `addmargins()`
- looking for the percentage values? `prop.table()`
- somewhat nicer output: `ftable()`

```{r}
addmargins(table(surveyrbioc$Q2, surveyrbioc$Q3))
prop.table(table(surveyrbioc$Q2, surveyrbioc$Q3))
```


Please always do check the docs!



## Exercise session 4

The `MASS` package contains the dataset `Cars93`, which stores the data on 93 makes of car sold in US

- you'll need the package *and* the data
- `Type` specifies the type of market the car is aimed at. Find the cheapest car in each type, and the one with the greatest fuel efficiency
- compute the mean horsepower for each type
- create two `data.frame`s, one for US cars, the other one with non-US cars
- export the US cars to a text file
- save the non-US cars data to a binary file (`.RData`)

### Exercise Session 4 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
library(MASS)
head(Cars93)
?Cars93
tapply(X = Cars93$Min.Price, INDEX = Cars93$Type, FUN = min)
tapply(X = Cars93$Horsepower, INDEX = Cars93$Type, FUN = mean)

table(Cars93$Origin)
us_cars <- Cars93[Cars93$Origin == "USA", ]
nonus_cars <- Cars93[Cars93$Origin != "USA", ]
# write.csv(us_cars, file = "us_cars.csv")
# save(nonus_cars, file = "nonus_cars.RData")
```

</details>

# Step 4: Plotting data {#step4}

## Graphics in R

 - powerful environment for visualizing scientific data
 - integrated graphics **AND** statistics
 - publication-ready quality
 - fully programmable, highly reproducible

Many ways for the same task:

- base graphics (`plot`)
- `ggplot2`
- `lattice`
- interactive visualizations such as `plotly`, `ggvis` or other libraries

Why bother plotting at all?

- facilitate comparisons
- identify trends
- generate hypotheses



## The `plot` function

First thing: take a look at the overview documentation of `plot`

```{r}
?plot
```

We will see

- scatter plots
- boxplots
- barplots
- histograms


## `plot` parameters 

Required:

- x variable
- y variable

Other options

- title with `main`
- axes labels with `xlab` and `ylab`
- axes limits with `xlim` and `ylim`
- symbols, colors and sizes: `pch`, `col` and `cex` - as atomic elements or as vectors


## Get to know the data: `mpg`

```{r}
library(ggplot2) # this is useful per se, and contains the dataset we will be using
?mpg
```

    This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov

```{r}
# works on RStudio
# View(mpg)
# otherwise stick to the classic
str(mpg)
```

Make a guess: what do you expect to see between fuel consumption and engine size?

## Scatter plots

```{r eval=TRUE}
plot(mpg$displ, mpg$cty)
```

Bonus: what is the `cor`relation?

```{r}
cor(mpg$displ, mpg$cty)
cor(mpg$displ, mpg$cty, method = "spearman")
```


### Can we do more?

```{r eval=TRUE}
mpg$mygroup <- as.numeric(factor(mpg$class))
plot(mpg$displ, mpg$cty,
  col = mpg$mygroup
)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
```


```{r eval=TRUE}
plot(mpg$displ, mpg$cty,
  pch = as.numeric(factor((mpg$class)))
)
```

This shows we have quite some overlap of points. What can we do?

Adding some jitter...

```{r eval=TRUE}
plot(
  x = mpg$displ + rnorm(nrow(mpg), mean = 0, sd = 0.01),
  y = mpg$cty + rnorm(rnorm(nrow(mpg), mean = 0, sd = 0.01)),
  col = mpg$mygroup,
  main = "now with jitter!"
)
```


Adding a smoothing line

Trying to see a pattern? Add a smoothing curve.

This one is wrong - missing the reordering of points

```{r eval=TRUE}
plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
lines(mpg$displ, myfit)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
```

This one is correct!

```{r eval=TRUE}
plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
myord <- order(mpg$displ)
lines(mpg$displ[myord], myfit[myord])
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
```

`lines` can add (almost) anything (any line). 

`points` works in a similar way to superimpose, well, points


## Bar charts

```{r}
?barplot
```

```{r eval=TRUE}
academia_levels <- table(surveyrbioc$Q2)
barplot(academia_levels)
```


## Boxplots

How is the age distributed across academic levels? Check the help of `boxplot`

- A formula is required!
- Don't worry, it's nothing but your `y~x` variables - ok, it can get more complicated
    
```{r eval=TRUE}
boxplot(Q1 ~ Q2,
  data = surveyrbioc
)
```

Splitting on more factors

```{r eval = TRUE}
boxplot(Q1 ~ Q2 + Q3,
  data = surveyrbioc
)
```

Making it more readable...

```{r eval = TRUE}
boxplot(Q1 ~ Q2 + Q3,
  data = surveyrbioc,
  las = 2
)
```

Changing the `par`ameters allows you to control many aspects on plot appearance
`par` is your best friend - and enemy (see `?par`)

```{r eval=TRUE}
par(mar = c(15, 3, 2, 2))
boxplot(Q1 ~ Q2 + Q3, data = surveyrbioc, las = 2)
```

`par( ... )` has many arguments; here, the useful/most used ones

- `mar` for handling the margins
- `cex`, `col`, `pch` and co. are all parameters of `par`
- `las` to change the style of the axis labels
- `mfrow` to draw an array of figures


## Histograms

```{r eval=TRUE}
hist(surveyrbioc$Q1, breaks = 8)
```


## More histograms!

```{r}
hist(mpg$cty, breaks = 10)
hist(mpg$cty, breaks = 10, col = "steelblue")
hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray")
hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray", main = "Distribution of miles/gallon consumption in city traffic")
```


## How to do nice pie charts

DON'T.

If you **really** need to do it...

```{r}
?pie
example(pie) # expecially the last one
```

```{r}
pie(c(20, 80),
  init.angle = -40,
  col = c("white", "yellow"),
  label = c("no pacman", "pacman"),
  border = "lightgrey"
)
```

... or switch from pie to waffle (seriously)

## How to do 3D exploded pie charts

**DON'T**. And this time I mean it

*sadly enough there would be packages for this, too*



## Extra: *dynamite* plots

a.k.a. Why is this bad?

```{r eval=TRUE}
age_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, mean)
sd_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, sd)
mybar <- barplot(age_by_group, col = c("khaki", "salmon", "firebrick"), ylim = c(0, max(age_by_group) + 5))
# mybar, inspect it
arrows(mybar, age_by_group, mybar, (age_by_group + sd_by_group), length = 0.15, angle = 90)
```


Dynamite plots VS boxplots

```{r eval=TRUE}
boxplot(Q1 ~ Q2,
  data = surveyrbioc
)
```

Median VS distribution VS actual points... What do you really want to show?


## What can you do more with your plot?

- change the points type - see `type` in `?plot`
- use log scales - see `log`
- annotate (some of) the points - with `text`
- change font sizes, styles and so on
- use special characters with `expression`
- save the plot
  - use the point-and-click interface in RStudio
  - code it
    
### Saving your plots

General code structure for this

```
opendevice()
...
code for the plot
...
closedevice()
```

```{r eval=FALSE}
pdf("myfilename.pdf")
# see also alternatives:
## png()
## jpeg()
plot(mpg$displ, mpg$cty,
  col = mpg$mygroup
)
dev.off()
```



## Petals and sepals

<img src="images/petal-sepal.jpg" alt="" height="600"/>

## Exercise session 5

Back to the `iris`. Three species are there. Explore the dataset in the following ways:

- draw a histogram of the petal length. What do you see?
- plot sepal length versus petal length. Add different colors to highlight the species
- do the same for sepal width and sepal length, and this time use a different symbol for the species. Add a legend and a title if you want
- (harder) calculate the mean values of each feature for each species, organizing it in a matrix where the rows are the species names. Generate a stacked bar plot with it, and another one where the bars are arranged horizontally

- feel free to go back to the survey data and explore it further!

### Exercise Session 5 - Solutions

<details>
<summary>Click on this to display the solution</summary>

```{r}
hist(iris$Petal.Length)
plot(iris$Sepal.Length, iris$Petal.Length)
plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species)
plot(iris$Sepal.Width, iris$Sepal.Length, pch = as.numeric(factor(iris$Species)))
legend("topright", legend = levels(factor(iris$Species)), pch = unique(factor(iris$Species)))

sl_means <- tapply(iris$Sepal.Length, iris$Species, mean)
pl_means <- tapply(iris$Petal.Length, iris$Species, mean)
sw_means <- tapply(iris$Sepal.Width, iris$Species, mean)
pw_means <- tapply(iris$Petal.Width, iris$Species, mean)
mymat <- cbind(sl_means, pl_means, sw_means, pw_means)
barplot(mymat, legend.text = unique(iris$Species))
barplot(mymat, beside = TRUE, legend.text = unique(iris$Species))
```

</details>

## Something cool to have an overview...

```{r eval=TRUE}
pairs(iris[, 1:4], col = iris$Species)
```

You can use the panels even more cleverly, check the help of `pairs`!

This is a collection on graphs in R - with the underlying code too.

http://shiny.stat.ubc.ca/r-graph-catalog/


## The `gapminder` project

Sit and relax/enjoy...

https://www.youtube.com/watch?v=hVimVzgtD6w

<!-- <iframe width="854" height="480" src="https://www.youtube.com/embed/hVimVzgtD6w?t=25" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> -->

## Meet `ggplot2`

But first, meet the `gapminder` data 

```{r}
library(gapminder)
head(gapminder)
head(country_colors)
head(continent_colors)
```

Variables: 

- country 	
- continent 	
- year 	
- lifeExp, life expectancy at birth
- pop, total population 
- gdpPercap, per-capita GDP

<!-- <img src="https://raw.githubusercontent.com/jennybc/gapminder/master/data-raw/gapminder-color-scheme-ggplot2.png" alt="" height="400"/> -->

<!-- ![](https://raw.githubusercontent.com/jennybc/gapminder/master/data-raw/gapminder-color-scheme-ggplot2.png) -->


## The `ggplot2` philosophy

`gg` stands for the grammar of graphics

- you provide the `data`
- you map the data to `aes`thetics (shape, size, colour)
- you add `geom`s to specify how you want to have the data plotted
- you can have `stat`istical transformations
- `facet`s allow you to do quick elegant multi plots

It can come across somewhat harder since

- data need to be tidy - one observation per row
- requires an extra step for abstraction

yet, it makes the whole process of "thinking data" more natural.


### A quick dive into the many options


```{r eval = TRUE}
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
```

```{r eval=TRUE}
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()
```

We can store `ggplot` plot objects into a variable - and build upon that later 

```{r eval=TRUE}
p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
p + geom_point()
```

```{r eval=TRUE}
p + geom_point() + scale_x_log10()
p <- p + scale_x_log10()
```

```{r eval=TRUE}
p + geom_point(color = "blue")
```

```{r eval=TRUE}
p + geom_point(color = "steelblue", pch = 19, size = 8, alpha = 1 / 4)
```

```{r eval=TRUE}
p + geom_point(aes(color = continent))
```

```{r eval=TRUE}
p + geom_point(aes(col = continent), size = 4)
```

```{r eval=TRUE}
p + geom_point(aes(col = continent, size = pop))
```

```{r eval=TRUE}
p + geom_point(aes(col = continent, size = pop)) + geom_smooth()
```

```{r eval=TRUE}
niceone <- p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(aes(col = continent), se = FALSE)
niceone
```

```{r eval=TRUE}
p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(lwd = 2, se = FALSE, method = "lm", col = "red")
```

```{r eval=TRUE}
p + geom_point(aes(col = continent, size = pop)) +
  geom_smooth(aes(col = continent), lwd = 2, se = FALSE, method = "lm")
```

```{r eval=TRUE}
p + geom_point() + facet_wrap(~continent)
```

```{r eval=TRUE}
p + geom_point(aes(col = continent)) + facet_wrap(~continent)
```

```{r eval=TRUE}
p + geom_point(aes(col = continent)) + geom_smooth() + facet_wrap(~continent)
```

### Saving the plots

Becomes de facto easier - works on "plot objects".

```{r eval=FALSE}
ggsave(file = "myplot.png")
```

### Line plots

```{r eval=TRUE}
ggplot(
  gapminder,
  aes(x = year, y = lifeExp, group = country, color = country)
) +
  geom_line(lwd = 1, show.legend = FALSE) +
  scale_color_manual(values = country_colors) +
  theme_bw() +
  theme(strip.text = element_text(size = rel(1.1)))
```

```{r eval=TRUE}
bp <- ggplot(
  gapminder,
  aes(x = year, y = lifeExp, group = country, color = country)
) +
  geom_line(lwd = 1, show.legend = FALSE) +
  facet_wrap(~continent) +
  scale_color_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1)))
bp
```

One step away from making things interactive - more interesting? more accessible? more explorable?

```{r eval=TRUE,message=FALSE}
plotly::ggplotly(bp)
```


### Boxplots


```{r eval=TRUE}
# now it is a categorical x VS continuous y
p <- ggplot(gapminder, aes(x = continent, y = lifeExp))
```

```{r eval=TRUE}
p + geom_point()
```

```{r eval=TRUE}
p + geom_point(alpha = 1 / 4)
```

It is so easy to escape *dynamite* plots!

```{r eval=TRUE}
p + geom_jitter()
```

```{r eval=TRUE}
p + geom_jitter(aes(col = continent))
```

```{r eval=TRUE}
p + geom_boxplot()
```

```{r eval=TRUE}
p + geom_boxplot() + geom_jitter(alpha = 1 / 2)
```

### Histograms

```{r eval=TRUE}
p <- ggplot(gapminder, aes(lifeExp))
p + geom_histogram()
```

```{r eval=TRUE}
p + geom_histogram(binwidth = 1)
```

Stacked histogram are much easier in this framework

```{r eval=TRUE}
p + geom_histogram(aes(color = continent))
```

```{r eval=TRUE}
p + geom_histogram(aes(fill = continent))
```

```{r eval=TRUE}
p + geom_histogram(aes(fill = continent), position = "identity")
```

... and so is the superimposing of more than one distribution

```{r eval=TRUE}
p + geom_histogram(aes(fill = continent), position = "identity", alpha = 0.4)
```

Similar to histogram, you can use also density plots

```{r eval=TRUE}
p + geom_density(aes(fill = continent), alpha = 1 / 4)
```

### Themes: a quick way to put a new shirt on

```{r eval=TRUE}
niceone + theme_bw()
```

```{r eval=TRUE}
niceone + theme_void()
```

If you really really really have to...

```{r eval=TRUE}
library("ggthemes")
niceone + theme_excel() + scale_color_excel()
```

## Exercise session 6 - Homework if you want

- try to recreate the plots you did with base graphics, this time using `ggplot2` 

- pick a nice plot you would like to have in your next manuscript: can you think of what you need to do it? I am talking of 
    - what data type?
    - what transformations?
    - what plot type/layer?


<!-- # tabular data analysis - on the RNA dataset? -->

<!-- # data visualization -->

<!-- # joining tables (?) -->

<details>
<summary>Click on this to display the solution</summary>


</details>


# Bonus Step: reproducible reports {#reproreports}

Our aims: 

- Understand what R Markdown is and why you should use it
- Learn how to construct an R Markdown file
- Export an R Markdown file into many file formats
- $\rightarrow$ You are all set to use Rmd to document any of your analyses!

## Reproducible reports with R Markdown

R Markdown allows you to create documents that serve as a neat record of your analysis. 

Why?

- we want other researchers to easily understand what we did in our analysis, otherwise nobody can be certain that you analysed your data properly (yay, reproducible research!)
- create an R markdown document as an appendix to a paper or project assignment, upload it to an online repository such as Github, or simply to keep as a personal record (*future you* will thank *present you* for this)

The key point is...

R Markdown documents present your code alongside its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook. To do this, R Markdown uses **markdown** syntax. 



## Markdown

Markdown is a very simple *markup* language which provides methods for creating documents with headers, images, links etc. from plain text files, while keeping the original plain text file easy to read. 

You can convert Markdown documents to many other file types like `.html` or `.pdf` to display the headers, images etc..

It might sound complicated. But *really* isn't, 

![](https://media.giphy.com/media/26BRNoQJ5bRcZS8Hm/giphy.gif)

First things first: install the required software

- R and RStudio (guess you have it already)

```{r eval=FALSE}
install.packages("rmarkdown")
```

```{r}
library(rmarkdown)
```

- `knitr` comes along, `pandoc` too. You should quickly be all set!

## Basics of markdown

ABC here, let's go through it:

http://rmarkdown.rstudio.com/authoring_basics.html

plus... a beautiful cheat sheet is there for you!

http://rmarkdown.rstudio.com/lesson-15.html

http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf

Using LaTeX? No problem, you can use $\LaTeX$ here as well!

$\left( f(x) = \sum_{i=0}^{n} \frac{a_i}{1+x} \right)$

What can you do with R markdown?

http://rmarkdown.rstudio.com/gallery.html

http://rmarkdown.rstudio.com/formats.html


## Let's create one together!

### Why is Rmd better than R

![](https://i.imgflip.com/22u565.jpg)

The price to pay to have an Rmd document is sooooo small - and for that, you get

- code, text, output all together
- one file only - no need to get lost
- it even looks nice :)


### Create an R Markdown file 

To create a new R Markdown file (`.Rmd`), select `File -> New File -> R Markdown...` in RStudio, then choose the file type you want to create. 

The newly created `.Rmd` file comes with basic instructions but we want to create our own R Markdown script, so let's get to know the different parts of an Rmd file

- An (optional) YAML header surrounded by `---`s
- R code chunks surrounded by backticks (```)
- text mixed with simple text formatting

### Inserting figures 

Uh, you can insert figures also like this

```
![](images/grcat.png)
```

![](images/grcat.png)

## Insert text and code - any text, any code

````
```{r}
n <- 10
rnorm(n)
```
````

Shortcut: `Ctrl + Alt + I`    

Input code: you can use multiple languages including R, Python, and SQL, many more (specify the language in the chunk options)


Inline code can be added with `` `r
1+1` ``


### Chunk options

Deatiled very nicely here: https://yihui.name/knitr/options/

A simple set of options which you can use for many documents:

```{r, echo=TRUE, warning=FALSE, eval=FALSE}
set.seed(42)
knitr::opts_chunk$set(
  comment = NA,
  fig.align = "center",
  fig.width = 7,
  fig.height = 7,
  warning = FALSE,
  eval = TRUE
)
```


### Knit!

Use the `Knit` button in the RStudio IDE to render the file and preview the output with a single click or keyboard shortcut (`Ctrl + Shift + K`).

To generate a report from the file, run the `render` command (works also outside of RStudio):

```{r eval=FALSE}
library("rmarkdown")
rmarkdown::render("yourfile.Rmd")
```

It was a deep dive, but now...

- You are familiar with the Markdown syntax and code chunk rules.
- You can include figures and tables in your Markdown reports.
- You can create R Markdown files and export them to pdf or html files.


## (Much) more on R Markdown

- http://rmarkdown.rstudio.com/
- http://stat545.com/block007_first-use-rmarkdown.html
- http://rmarkdown.rstudio.com/lesson-1.html
- ... to http://rmarkdown.rstudio.com/lesson-15.html
- http://rmarkdown.rstudio.com/articles.html

You can do much much more (presentations, websites, manuscripts,...)

## Exercise session Bonus

- create a new R Markdown document
- can you find out how to generate a word document as output?
- insert some code you previously used for exploring the small survey data - remember, a fresh session is run when knitting, so you need the commands from the very start!


### Exercise Session Bonus - Solutions

<details>
<summary>Click on this to display the solution</summary>

- `File -> New File -> R Markdown...` in RStudio
- add this in the yaml header

```
output:
  word_document
```
<!---->
</details>





# Useful material {#morematerial .unnumbered}

Made @IMBEI

- https://imbeimainz.github.io/GTIPI2022/ and https://github.com/imbeimainz/GTIPI2022_materials if you want to know more on the principles of data visualization
- this very repository of course material (keep this monitored over time!): https://github.com/imbeimainz/courseware_CTH_2024


Books

- R in a nutshell, R cookbook, R graphics cookbook (@O'Reilly media)
- A Beginner's Guide to R (Zuur, Ieno, Meesters, @Springer)
- R Programming for Data Science (Peng, @Leanpub)
- R Programming for Bioinformatics (Gentleman, @CRC)
- Bioconductor Case Studies (@Springer)
- Data Analysis for the Life Sciences (Irizarry, Love, @Leanpub)
- Bioconductor - An Introduction to Core Technologies (Hansen, @Leanpub)
- R for Data Science (Wickham, Grolemund, @O'Reilly)

- the whole `Use R!` book series: https://link.springer.com/bookseries/6991
- this one from CRC: https://www.crcpress.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER
- https://link.springer.com/book/10.1007/978-3-662-53670-4
- https://link.springer.com/book/10.1007/978-3-662-49102-7


Courses

- https://www.datacamp.com/courses/free-introduction-to-r
- https://www.coursera.org/specializations/jhu-data-science
- https://www.edx.org/course/introduction-r-data-science-microsoft-dat204x-7

Misc

- http://r4stats.com/articles/why-r-is-hard-to-learn/
- https://www.rstudio.com/resources/cheatsheets/
- swirl - learn R in R: http://swirlstats.com/

<!-- ## What we left out -->

<!-- - advanced data manipulation - `dplyr` and the mighty pipe -->
<!-- - string manipulations -->
<!-- - ggplot2 in more detail -->
<!-- - package development (your own package) -->
<!-- - shiny apps development (from 0 to app) -->
<!-- - reproducible reports (also interactive!) -->



# Session Info {#sessioninfo .unnumbered}

As a lesson, do take note/report the versions of R/the different packages you used for your analysis.

Why? 

<details>
<summary>Click on this to display the solution</summary>

* You will (should) report these in your Material & Methods
* You credit the work of others - as a direct consequence
* Different package versions might lead to different results
* It is a good scientific research practice!
* It helps others help you - in case you want to post your question on StackOverflow or similar
* Because one command just does it for you

</details>

```{r}
sessionInfo()
```
